Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count_values string formatting #8763

Open
tcolgate opened this issue Apr 27, 2021 · 40 comments
Open

count_values string formatting #8763

tcolgate opened this issue Apr 27, 2021 · 40 comments

Comments

@tcolgate
Copy link
Contributor

Proposal

It would be useful/consistent if count_values used %g formatting, consistent with the le bucket label in histograms. It is useful to be able to assume the two are consistent. This is of practical use when calculating a true apdex as it permits the application to expose a numeric apdex target value, which we can then translate into a bucket name for matching purposes.

The numeric value is useful for establishing which apdex target value to use in the case where a deployment has change the target apdex value.

The technique is documented here:

https://medium.com/@tristan_96324/prometheus-apdex-alerting-d17a065e39d0

but is unreliable, and additional regex stages are likely needed to append a .0 to integer values in the case where %g has been used for the bucket labels, but is not used by count_values.

@tcolgate tcolgate changed the title count_values to use %g format count_values to use OpenMetrics string format Apr 27, 2021
@roidelapluie
Copy link
Member

hello,

I think this is a breaking change in Prometheus, so not doable as-is in Prometheus 2.x. We would need to think about alternatives, like adding an extra "formatting" parameter for the count_values.

I would also need to check what the openmetrics spec for formatting is and what the actual implementation is to see the better solution.

Why do you speak about %g but your proposed code does not use %g ?

@roidelapluie
Copy link
Member

Your application could maybe also expose http_apdex_target_seconds with the correct label.

@tcolgate
Copy link
Contributor Author

hello,

I think this is a breaking change in Prometheus, so not doable as-is in Prometheus 2.x. We would need to think about alternatives, like adding an extra "formatting" parameter for the count_values.

I would also need to check what the openmetrics spec for formatting is and what the actual implementation is to see the better solution.

Why do you speak about %g but your proposed code does not use %g ?

The "%g" terminology came from some previous bug reports around different bucket le labels in different clients. Before I found the actual code. It's also referenced in the OpenMetrics spec.
The target rendering is equivalent to the default Go rendering of float64 values (i.e. %g), with a .0 appended in case there is no decimal point or exponent to make clear that they are floats.

@roidelapluie
Copy link
Member

Okay, would not %g render +inf, Nan etc out of the box as we would expect?

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 27, 2021

Your application could maybe also expose http_apdex_target_seconds with the correct label.

The advantage with using a numeric , instead of a hard coded label, is that I can use max/min to pick if two different versions of the same app happen to provide different values. (I could expose the metrics with both the value and the bucket, I'd actually have to recreate the openmetrics string code in my own app to know how the thing would be rendered).

Hopefully it makes sense that my example case here is just an example and that formalising the actually format count_values uses (even if it needs an extra arg or different function), has some value in itself.

@tcolgate
Copy link
Contributor Author

Okay, would not %g render +inf, Nan etc out of the box as we would expect?

The code in the PR is taken from the prom/common repo so does handle rendering +/-Inf and NaN, and enforces the .0 suffix.

@roidelapluie
Copy link
Member

roidelapluie commented Apr 27, 2021

The advantage with using a numeric , instead of a hard coded label, is that I can use max/min to pick if two different versions of the same app happen to provide different values.

You could use topk(1, http_apdex_target_seconds) without(le)

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 27, 2021

You could use `topk(1, http_apdex_target_seconds) without(le)

The exact choice of how to pick the highest isn't super relevant (we just use min at present I think). When generating the metric I'd have to generate...

http_apdex_target_seconds{le="5.0"} = 5.0
http_apdex_tollerable_seconds{le="20.0"} = 20.0

And have to be confident that I'm rendering the le label in the same way that the client library does.
That's doable, but not as convenient as being able to just do the maths on http_apdex_target_seconds and convert the value into a label in a stable/reliable way that can match a known well formatted bucket.

Phrased differently...

currently the format for the resulting label from count_values is unspecified. It might be better if it were well specified, and matched other places where the rendering of float is labels is already well specified.

@roidelapluie
Copy link
Member

I will ping @juliusv and @beorn7 for their opinion on this.

@beorn7
Copy link
Member

beorn7 commented Apr 27, 2021

The original sin here is that we (ab-)use label values (which are strings) as numbers (in this case bucket boundaries).

The most striking dissonance here comes from the different formatting conventions in different languages, mostly that Go doesn't append a .0, while Java or Python do. While Prometheus itself is just Go, instrumentation libraries are in different languages. So a Python or Java target would expose a bucket boundary of 5 as a label {le="5.0"}, while a Go target uses {le="5"}. That's only a problem in the text format, as the protobuf format uses actual floats. Prometheus 1.x circumvented the problem by sanitizing le and quantile labels (the latter of summaries) so that only Go formatting of those "numbers in labels" would hit the TSDB. Prometheus 2.x tossed both the protobuf format and the sanitizing. Which led to different formatting conventions ending up in the TSDB.

OpenMetrics tried to address that problem, sadly not by reintroducing sanitation of the format (and, in continuation, keeping open the possibility of a TSDB that can handle those numbers as actual numbers, as Prometheus will do with the new Sparse Histograms), but by requiring the instrumented target to do the "correct" formatting. The latter wouldn't be too bad if the "correct" formatting were actually well defined. But in fact it is not. The OM spec works for integer numbers and certainly covers most cases in practice, but there are numbers that can be correctly represented in up to nine different ways, and the OM spec does not specify which one to pick in those cases. (The usual formatting algorithms and language specs are silent about this. They are all only concerned that the text representation will parse back into exactly the same float, but that isn't enough for our case.)

The conclusion is that OM might work in most practical cases, but it is not a proper solution to the problem. (With something like exponential buckets I could see the above described problem to also occur in practice. Needs to be investigated, but my assumption is that this is not purely academical.)

Another aspect of the problem is that we have maneuvered Prometheus 2.x into a position where it is hard to get out without significant friction of breaking changes. If you switch Go targets to using OM, all your bucket and quantile timeseries change. If we re-introduced sanitizing, all the Java/Python/... time series would change. And in neither case would the price we pay get us to a proper solution (as formatting of numbers with many decimal places will still not be well defined).

We mitigated the situation for histogram_quantile by letting that function internally parse the le label into a float and then only use the resulting value. But if you want to pick a specific bucket, you still have to find work arounds (like using a regexp match?).

With the label value created by count_values, we have a similar problem. With the same implications. I.e. we could apply the OM formatting, which would break current usage but still doesn't get us all the way to a proper solution. At least, because count_values is just happening inside Prometheus, it always uses the Go formatting. It is as well (or not-well) defined as OM formatting. The formatting matches what the Go instrumentation library uses (if not configured to use OM), and it matches what all le and quantile labels used in Prometheus 1.x. I would say we should not change the current behavior until we have found a solution that is significantly better (i.e. not just replacing one ill-defined string formatting with another ill-defined string formatting).

@tcolgate
Copy link
Contributor Author

FWIW, I decided to "bite the bullet" and switch to OpenMetrics for our Go apps, and that is what brought me to this. The decision was based entirely on it being the only way to get exemplar support, and I can imagine a lot of people will likely do the same. It is worth a few days of broken heat maps. I've not seen any discussion of introducing exemplars in the non-OM expfmt (though I get if proto comes back, that's an option).

@tcolgate
Copy link
Contributor Author

I'll crack on with the regexp.

@roidelapluie
Copy link
Member

roidelapluie commented Apr 27, 2021

It feels like this is easily done (openmetrics -> text format):

metric_relabel_configs:
- source_labels: [le]  
  target_label: le        
  regex: '([0-9]+)\.0' 

The other way around (text format -> openmetrics):

metric_relabel_configs:
- source_labels: [le]  
  target_label: le        
  regex: '([0-9]+)'
  replacement: '${1}.0' 

@beorn7
Copy link
Member

beorn7 commented Apr 27, 2021

Cool. Just that it is the other way around, isn't it? The first relabels OM into the old (client_golang) behavior, while the second relabels old (client_golang) behavior into OM. (Note that the old text format didn't specify any particular float format. Prometheus 1.x was sanitizing the floats-as-string anyway.)

@roidelapluie
Copy link
Member

Thanks, I have updated my comment.

@tcolgate
Copy link
Contributor Author

Thanks for the tips. I didn't raise this to fix an unresolvable problem though, if count_values is going to convert numbers to strings, it seems like it should do so in a "known way". I appreciate @beorn7 points, though I would say that there is, so far as I can see, a 1 <-> 1 mapping of the OpenMetrics "%g with trailing .0", and the behaviour should be reproducable by clients, so it seemed like a sensible option. If changing the behaviour is undesirable, it might be worth at least documenting the existing behaviour (as golang %f at least), and possibly mentioning that it is not string matchable against OpenMetrics histogram le labels without regexp.

On a side note, purely out of curiosity, would that le relabeling impact exemplar lookup?

@roidelapluie
Copy link
Member

Saying that OM uses %g is incorrect. We are using %g in the text format too: https://github.com/prometheus/common/blob/main/expfmt/text_create.go#L449
Somehow %g does not add the .0 and we artificially add it in openmetrics.

@roidelapluie
Copy link
Member

On a side note, purely out of curiosity, would that le relabeling impact exemplar lookup?

It should yes.

@tcolgate
Copy link
Contributor Author

Saying that OM uses %g is incorrect. We are using %g in the text format too: https://github.com/prometheus/common/blob/main/expfmt/text_create.go#L449
Somehow %g does not add the .0 and we artificially add it in openmetrics.

I appreciate it does not use %g directly (the PR I've closed included the copied code from client_golang so that the behaviour would match, as mentioned above), the terminology was stuck in my head from reading previous PRs that referred to it as such. It's not terribly relevant to anything if this isn't going ahead.

@beorn7
Copy link
Member

beorn7 commented Apr 28, 2021

What %g precisely does depends on the language/implementation. In Go, it does not add .0 to floats with an integer value.

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 28, 2021

It would be worth OM documenting whatever it expects languages to do, probably without reference to "Go %g".
It would be worth Prom documenting what count_values does, without direct reference to %f (or at least linking to docs).
It might be worth mentioning that they aren't compatible.
People probably aren't tripping over this all that often, I was doing tricksy promql things, and someone that read an article about them happens to suffer through applications with apdex targets over 0.25s, which triggered this problem (I am more fortunate).
@beorn7 I've been following the histogram work closely (have even attempted to use the protobuf implemented stuff on your client_golang for a side project I'm playing with), it's extremely exciting, though does raise interesting question about integration with exemplars (which have some of my dev focused colleagues doing backflips of joy).

@beorn7
Copy link
Member

beorn7 commented Apr 28, 2021

It would be worth OM documenting whatever it expects languages to do, probably without reference to "Go %g".
It would be worth Prom documenting what count_values does, without direct reference to %f (or at least linking to docs).
It might be worth mentioning that they aren't compatible.

OM's formatting behavior is documented in their draft spec, and it's even discussing a number of more subtle issues with number formatting.

Definitely agree that the count_values documentation should mention how it formats the numerical value when it ends up as a string in the label. However, I think we really have to just say "what Go does with %g directive" because it would be quite verbose to explain it completely, and my suspicion is that there are no guarantees of the details to be changing (between versions of Go or even between platforms). As I understand it, the only guarantee is that the string-formatted float will parse back into exactly the same binary representation of a float (although this cannot be true for NaN, but that's yet another story).

it's extremely exciting, though does raise interesting question about integration with exemplars (which have some of my dev focused colleagues doing backflips of joy).

Definitely we'll have exemplars in the new histogram world, too. It just won't be "one per bucket" anymore, so probably a separate section in the protocol. But those are details to flesh out later.

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 28, 2021

Definitely agree that the count_values documentation should mention how it formats the numerical value when it ends up as a string in the label. However, I think we really have to just say "what Go does with %g directive" because it would be quite verbose to explain it completely, and my suspicion is that there are no guarantees of the details to be changing (between versions of Go or even between platforms). As I understand it, the only guarantee is that the string-formatted float will parse back into exactly the same binary representation of a float (although this cannot be true for NaN, but that's yet another story).

From what I remember of the code, count_values uses %f. (it uses strconv.FormatFloat with f IIRC)

@beorn7
Copy link
Member

beorn7 commented Apr 28, 2021

I've just checked it out: It uses strconv.FormatFloat(s.V, 'f', -1, 64)), which boils down to %f with as many significant digits to create a unique value. And that's rather unfortunate. It means that it won't use exponents for very large or very small values, which is different from both OpenMetrics as well as the usual (but not specified) behavior of instrumentation libraries when exposing the classic Prometheus text format. It's using a third format! 😠

Now I'm thinking we might want to add an optional parameter to count_values so that the user can pick a format string (in Go fmt style).

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 28, 2021

That sounds like a good idea, though I'm not sure if there's literally a sprintf form that would be OM compatible. A reimplementation of a %.. formatter seems ideal but feels like a large job.
(I'm not OM obsessed, honest, but it would be a shame if any solution couldn't produce something that matches the one well specified bucket label format)

@tcolgate tcolgate reopened this Apr 29, 2021
@tcolgate tcolgate changed the title count_values to use OpenMetrics string format count_values string formatting Apr 29, 2021
@beorn7
Copy link
Member

beorn7 commented Apr 29, 2021

We could use the Go formatter, but add another verb that would format in OpenMetrics style.

@tcolgate
Copy link
Contributor Author

Well today I learned of fmt.Formatter, though there is no obvious way to fall back, so this kind of thing doesn't look super performant. Fun though:
https://play.golang.org/p/Loj_KVLky1U

@tcolgate
Copy link
Contributor Author

tcolgate commented Apr 29, 2021

It's not obvious that the full power of a Printf strings makes that much sense (e.g., there's not much point in multiple values. so maybe it'd be better to just allow something like
count_values("le,f.3",....) to allow precision to be set? The we can use strconv.FormatFloat, with extra stuff done for some other verb? So drop support for padding and the extra flags (not sure they do anything for floats anyway), just support precision, and the regular float formatting verbs.

@beorn7
Copy link
Member

beorn7 commented Apr 29, 2021

Sounds good to me. We could use o for OpenMetrics.

@tcolgate
Copy link
Contributor Author

o is already used for octal though, some poor soul might need that.

@beorn7
Copy link
Member

beorn7 commented Apr 29, 2021

But if we use https://golang.org/pkg/strconv/#FormatFloat , we can stick to the letters allowed there, and there is no o, so we could use it.

@tcolgate
Copy link
Contributor Author

@beorn7 I can have a go at implementing this, but would it be worth discussing on prometheus-dev? The use of "thing,soemthing" seem like the kind of thing that could prove contentious, and I'd rather not start if it's going to get shot down.

@beorn7
Copy link
Member

beorn7 commented May 3, 2021

Yeah, this sounds like a good idea.

@tcolgate
Copy link
Contributor Author

tcolgate commented May 4, 2021

Is prometheus-developers@googlegroups.com the right list?

@beorn7
Copy link
Member

beorn7 commented May 4, 2021

yes.

@tcolgate
Copy link
Contributor Author

tcolgate commented May 5, 2021

Just for the record, conversation continued here:
https://groups.google.com/g/prometheus-developers/c/1OCdPPqGBuQ

@roidelapluie
Copy link
Member

It seems that the consensus and direction in the mailing list is to introduce %.

If someone wants to implement Tristan's suggestion, I would welcome it.

@tcolgate
Copy link
Contributor Author

tcolgate commented Jun 4, 2021

I didn't get much of a consensus vibe, but such that anything has support beyond myself and @beorn7 , I think a %2g style suffix on the label argument was the final decision. With the use of o discussed here (not really on list though) for OpenMetrics,
This would probably make a good first issue for someone (if there's a label for such things?), but if no one steps up, I'll have a go (I'm not going to be rushing though, bit busy at the moment).

@roidelapluie
Copy link
Member

I clarified in the mailing list that I agree, indeed.

@abrenk
Copy link

abrenk commented Mar 30, 2023

A workaround in recording rules (where you can't use metric_relabel_configs) is to use label_replace(..., "le", "$1.0", "le", "([0-9]+)").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants