Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upInconsistent and confusing regex anchoring behavior #1200
Comments
This comment has been minimized.
This comment has been minimized.
|
Option 2. ist roughly what my discussion about this with @beorn7 would have lead me. I don't have an issue with doing anchoring myself and find the wrapping odd. But people form different backgrounds may feel the opposite. In configs there may be the risk of people not anchoring properly out of laziness, which could lead to conflicts later on, though unlikely. As queries are more volatile, I don't feel the same reasoning applies there – certainly not partially for the relabel functions only. |
This comment has been minimized.
This comment has been minimized.
|
FWIW, as part of #996 I also meant to make regex label matchers anchored back then (it's even stated as a goal in the issue), but apparently forgot that part. Up until today, I thought that I had made them anchored too. The intent was to make everything consistent, although that part was forgotten. As for usages within SC, I've seen some wrong ones in the past (assuming anchored matches when they were not), but they are probably fixed due to user education by now. As for 2), wouldn't anchoring some regexes and not others (config) be even more confusing to users than having anchoring everywhere? What I like about anchoring is that if I want to match multiple full label values (common use case), I just have to write Anyways, I trust y'all to do the right thing. Just remember, if you do decide for 2), it'll be a second breaking change to Personally I prefer 4), but that might just be what I like myself. /cc @brian-brazil |
This comment has been minimized.
This comment has been minimized.
There's many that do beginning anchors, the problem is they aren't usually at all documented in this regard. The most obvious would be
I always end up having to explicitly anchor, as otherwise I'd end up having to read the code to determine if this particular interface anchors. Due to the wide inconsistencies in this area I'm for 4) as that's probably what the user wants and had thought selectors already worked this way.
This is different from my experience. Sysadmins tend to know it reasonably well, and developer knowledge tends not to be too great. |
This comment has been minimized.
This comment has been minimized.
Please give examples for this claim. You presented one function in Python where a similar function is also available with partial match behavior. To my knowledge at least, the standard libraries in Java, Go or Ruby, the cmdline tools grep, sed, awk, test or the databases influxdb, graphite, opentsdb do all partial matching. Just to name a few in the related fields.
Maybe time for a survey. I checked a few of the expressions written at SoundCloud and found examples for half anchored expressions ( As I can not see your described "wide inconsistencies", my vote is to not get into a guessing game what a user might want here, but to follow the most common implementation (which seems to be partial matching by far) unless we have evidence that the absolute majority of Prometheus users indeed prefers anchored matching. |
This comment has been minimized.
This comment has been minimized.
For clarity please don't use the term "partial match", that has a completely different meaning when we're talking about text search. All the regexes we're talking about here are fully matched. The correct term is anchoring. In my experience it's a coin toss as whether a given Python program or interface will use unanchored or beginning anchored regexes, so I gave up tracking which was which many years ago. Go does not document if it is anchored (reading between the lines it's unanchored). Java is documented as anchored. Ruby is documented as unanchored. Given that OpenTSDB is written in Java and is backed by Cassandra which is also in Java I'd presume that it's also anchored (the docs give no indication, there's a hint that it's fully anchored). There is no standard here. It boils down to which language(s) you're used to using.
We've had several users get confused by the previously unanchored behaviour of Prometheus configs, particularly with relabelling. Looking at the end result of a code base doesn't tell you about what user expectations are, it just tells you what the user ended up doing after an unknown amount of trial and error. I agree with @juliusv that most users are looking for anchored matches. In the case of properly used labels you're going to almost always want fully anchored matches, and if you think you don't then you probably haven't thought things all the way through. Taking your second example, "{namespace}-{db}-{suffix}" is not a proper use of labels as that should be three separate labels. Due to this suboptimal label choice, you've also got an incorrect regex of If that was three separate labels then Any completely unanchored use of regexes with labels is likely incorrect. You need anchors to be safe. |
This comment has been minimized.
This comment has been minimized.
|
First of all: +1 for consistency. (Good that we clarified that the current inconsistency was an oversight and not a deliberate decision.) WRT "standard" behavior: grep, sed, awk all work line-based, where non-anchored matching makes sense (because you want to find part of a line). Matching label values is a different thing because you are usually interested in the whole label. As a Unix command line counter-example look at Library review:PythonHas C++The popular RE2 library has Boost has Java
ConclusionIt doesn't make sense to talk about "standard" behavior. Libraries either do not acknowledge a standard or call anchored matching just "match" and non-anchored matching something like "find" or "search", implying anchored matching is the more "natural" matching. Command line tools usually do non-anchored matching. As said initially, I first and foremost want consistency. I don't feel too strongly about the choice, but my preference in the context of Prometheus would be full anchoring. |
This comment has been minimized.
This comment has been minimized.
|
Going with a completely unified approach is easiest in the long run. I do think this will be breaking rules for a few people. As one is usually aware of the possible label values, skipping any leading and trailing |
fabxc
added a commit
that referenced
this issue
Nov 5, 2015
fabxc
closed this
in
#1201
Nov 5, 2015
This comment has been minimized.
This comment has been minimized.
|
Alright. Consistency inside of Prometheus is probably the most important thing, so the solution taken makes sense. p.s. @brian-brazil The example was for a |
wrouesnel
pushed a commit
to wrouesnel/prometheus
that referenced
this issue
Jan 5, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
grobie commentedNov 4, 2015
Regex anchoring got introduced by @juliusv and @brian-brazil in #996 in config files as well as for the
label_replace()function in promql.It was not changed anywhere else in promql, so label regex matchers are still not anchored. See this example:
node_boot_time{instance=~"robustperception"}.As I'm not aware of a single linux tool, programming langue or other regex interface which does anchoring (please give examples otherwise), I'm questioning the benefit of anchoring. It's a non-standard behavior, requires additional knowledge and therefore, makes it harder for people to get started with Prometheus. I'm able to write a regex in almost any tool immediately without the need to look up the documentation, but with prometheus it took me several tries to understand why my
label_replacefunction did not work as expected.I heard the argument that anchored regexs are shorter, they're not in all cases (
^4vs.4.*). In general I don't think this should be an argument over consistency with the majority of other regex implementations.I also heard that anchoring makes it harder for people to write wrong regular expressions. I checked tens of expressions using regular matchers at soundcloud and couldn't find a single one using ambiguous regexps. I'd argue that the majority of developers knows how to use regexps and disagree that it's Prometheus' role to prevent the minority of people from doing something stupid.
I see 4 options to solve this issue:
While I personally don't see the benefit of anchoring, my realistic vote is 2. It does not require another disruptive change to configs and provides a clear rule for user: anchoring in config, standard behavior in promql.
@fabxc @beorn7