Handle invalid UTF8 bytes in labels #128

hmac · 2019-06-10T15:50:06Z

This change forces all labels into UTF-8, replacing any invalid byte
sequences. This ensures that we can gracefully handle malformed labels.

@dmagliola @Sinjo could you review?

This change forces all labels into UTF-8, replacing any invalid byte sequences. This ensures that we can gracefully handle malformed labels. Signed-off-by: Harry Maclean <harrymaclean@gocardless.com>

SuperQ

Nice, LGTM.

coveralls · 2019-06-10T15:56:54Z

Coverage remained the same at 100.0% when pulling 0593ca6 on gocardless:hmac/bad-utf8 into 6ea675e on prometheus:master.

brian-brazil · 2019-06-10T16:18:05Z

In other clients we throw an exception when this happens.

hmac · 2019-06-10T17:10:44Z

@brian-brazil the downside of throwing an exception is it will cause the exporter to blow up, so we'll serve no metrics for as long as the bad label is present.

I do think it would be useful to know what label is causing a problem, so was considering logging if we encounter a malformed label. What do you think?

brian-brazil · 2019-06-10T17:27:04Z

It's generally best to make it obvious when there's bad data, rather than subtly failing by providing what is a valid label instead. This is the sort of thing that should be handled outside the client library.

Sinjo · 2019-06-11T12:17:16Z

In general I agree with the principle of surfacing errors early and loudly and it's a principle we largely follow in this library.

What seems a shame that the impact of a single error is magnified - an entire service stops reporting metrics until the dodgy label is cleared from the registry.

Another option would be to omit the metric, though I can see that being confusing to debug.

In some earlier discussions @dmagliola and I did toy around with the idea of a global error handler as part of the client library configuration. The idea being that people could write a small callback function that receives an exception and sends it off to their preferred exception tracker or logs it.

The original motivation behind it was to give people the option of a softer failure mode than the client library raising an exception (e.g. when label keys don't match the ones declared at the metric definition). In the end, we weren't completely sold that it was a necessary addition to the library, so we deferred the work until we had motivation for it.

Perhaps this is provides that motivation?

brian-brazil · 2019-06-11T12:23:20Z

An entire service stops reporting metrics until the dodgy label is cleared from the registry.

That'll be caught by up, so it should be dealt with prompty.

e.g. when label keys don't match the ones declared at the metric definition

That's a programming error, and an exception is appropriate.

Sinjo · 2019-06-11T15:44:48Z

That's a programming error, and an exception is appropriate.

Totally agree, and to be clear that would remain the default behaviour.

The idea behind the configurable error handler, at least when we thought about it in the context of incrementing/setting metrics, is that while it's important to know your instrumentation code is working, you want to minimise its impact on production traffic when it isn't.

This is an approach we've successfully used here at GoCardless - specifically in our logging library. It has a last ditch rescue that reports an error to our exception tracker if something unexpected goes wrong. The outcome being that we both know about the programming error to fix it and our production traffic doesn't feel the impact.

Not saying we have to go that specific route here in this library (e.g. @hmac's approach to this change wouldn't add any motivation for it), but that's the context for that approach.

lawrencejones · 2019-07-06T15:42:16Z

We're already filtering label values to escape newlines and backslashes link

Is it meaningfully different to replace a bad utf8 character with an accepted placeholder for unrecognised symbols?

I can't imagine anyone ever wanting their labels to crash the text formatter. I'm also struggling to see how this helps developers build more robust code. The most common situation I see here is:

Development uses well formatted inputs, so doesn't trigger the issue
Testing as above, no exception raised
Put your code in production and an unsanitised input finds its way into a Prometheus metric. Now your exporter fails and you can kiss goodbye to your metrics

I'd be keen to keep the current behaviour if it helped devs catch issues before they get to production, but I don't think that applies in this case. Every time the Prometheus client crashes production code damages efforts to encourage good practice around building reliable software, and I'd personally resent it if my metrics failed scraping because a tiny part of a larger app mishandled the encoding of a label.

brian-brazil · 2019-07-06T16:46:10Z

Is it meaningfully different to replace a bad utf8 character with an accepted placeholder for unrecognised symbols?

Yes, one is escaping valid input in line with the format. The other is taking invalid input and guessing.

Other clients throw exceptions in this case.

lawrencejones · 2019-07-06T17:37:39Z

Cool, it doesn't feel different to me, but i can see how others might disagree.

Could you help me figure out how we should address the other comments I made @brian-brazil? I understand that other clients don't behave like this but it being in other clients doesn't seem a valid argument to me, and if we stick with that we'll never improve the status quo.

brian-brazil · 2019-07-06T17:41:21Z

What would you suggest we do when data that violates the API contract is passed in?

lawrencejones · 2019-07-06T17:46:35Z

By the API contract, do you mean what we expect from a label in Prometheus? I think Chris made some good suggestions about either pushing this out to an error handler in the gem, or logging as appropriate. Anything to allow the library behaviour to gracefully degrade rather than blow up seems more appropriate for an instrumentation library, or at least more useful.

brian-brazil · 2019-07-06T17:54:13Z

By the API contract, do you mean what we expect from a label in Prometheus?

Yes, the contract is that you pass in a UTF-8 value. By any form of munging you potentially cause a label clash, as the output label is a valid value the user could have passed in - so there is no graceful degradation, just second-guessing the user so that things break differently.

lawrencejones · 2019-07-06T20:04:58Z

Yeah I think this is the primary issue with this solution- if you have two labels with differing invalid characters, they may map to the same filtered label. Given the specification for text format parsing probably doesn't cover this case, depending on the implementation you'd get different series, so I think this particular solution is probably not acceptable.

Still trying to focus on not blowing up all the metrics when we generate the text format with a bad encoding, how do you feel about an error handler that can present the error but continue while dropping the metric with the bad label?

brian-brazil · 2019-07-06T20:09:24Z

Given the specification for text format parsing probably doesn't cover this case, depending on the implementation you'd get different series, so I think this particular solution is probably not acceptable.

That'd be invalid input.

how do you feel about an error handler that can present the error but continue while dropping the metric with the bad label?

Experience with such things elsewhere has shown that that's a very bad idea, as you end up silently dropping arbitrary series on each scrape.

lawrencejones · 2019-07-06T20:35:38Z

That'd be invalid input.

That's good to know, as it means the approach in this PR might cause a similar level of error up the chain, which goes against the intent which was to reduce the scope of the error.

Experience with such things elsewhere has shown that that's a very bad idea, as you end up silently dropping arbitrary series on each scrape.

Also really useful context, and I'd imagine this state you describe- where you unwittingly lost metrics, unaware that your error was silenced- would have equally far reaching consequences as dropping all the series.

I think this might be mitigated if the default error handler would synchronously raise, and users could set this to something that made more sense for them if they wished. For the library this patch was meant to help, fluentd, you may never care about errors.

Alternatively, we could have a client mode where incrementing labels with metrics would do this character mapping when identifying the metric to increment instead of when we render the text format, though of course at this point you could argue this filtering is best done in the app, as you have suggested Brian.

I think @Sinjo and I can have a conversation on Monday about whether any of these alternatives make sense, or if it's best to leave this as is. Thanks for your feedback though Brian, it was useful if only to rule out this PR and provide surrounding context.

@hmac do you reckon we can close this?

hmac · 2019-07-08T17:44:47Z

No strong objections to closing this, since we no longer have an immediate need for it.

I do somewhat disagree with the objections raised here, though. I think the likelihood of label clash caused by two labels that differ only by a series of bytes which are themselves an invalid UTF-8 sequence is pretty low, and you'd probably spot it quite fast as they'd appear as foo_bar_⍰. As a user I'd much prefer this munging behaviour (plus a log telling me I need to fix it) to a totally non-functioning metrics endpoint. Currently the error raised gives no clue as to what label is problematic, so you could be spending a lot of time tracking down the issue - all the while you have no metrics in production!

lawrencejones · 2019-07-08T17:51:52Z

I'm not sure it would be correct in all cases for us to run the risk of a label clash, but I fully agree with you that the situation where the metric endpoint dies from one invalid label is wrong. I'm going to continue to think about how we could be more robust to that situation while providing useful feedback to the user about how to fix their labels.

brian-brazil · 2019-07-08T18:13:07Z

This is typically caught at instrumentation time, not exposition time.

Handle invalid UTF8 bytes in labels

0593ca6

This change forces all labels into UTF-8, replacing any invalid byte sequences. This ensures that we can gracefully handle malformed labels. Signed-off-by: Harry Maclean <harrymaclean@gocardless.com>

hmac force-pushed the hmac/bad-utf8 branch from 7465ae8 to 0593ca6 Compare June 10, 2019 15:53

SuperQ approved these changes Jun 10, 2019

View reviewed changes

hmac closed this Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid UTF8 bytes in labels #128

Handle invalid UTF8 bytes in labels #128

hmac commented Jun 10, 2019

SuperQ left a comment

coveralls commented Jun 10, 2019 •

edited

Loading

brian-brazil commented Jun 10, 2019

hmac commented Jun 10, 2019

brian-brazil commented Jun 10, 2019

Sinjo commented Jun 11, 2019

brian-brazil commented Jun 11, 2019

Sinjo commented Jun 11, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

hmac commented Jul 8, 2019 •

edited

Loading

lawrencejones commented Jul 8, 2019

brian-brazil commented Jul 8, 2019

Handle invalid UTF8 bytes in labels #128

Handle invalid UTF8 bytes in labels #128

Conversation

hmac commented Jun 10, 2019

SuperQ left a comment

Choose a reason for hiding this comment

coveralls commented Jun 10, 2019 • edited Loading

brian-brazil commented Jun 10, 2019

hmac commented Jun 10, 2019

brian-brazil commented Jun 10, 2019

Sinjo commented Jun 11, 2019

brian-brazil commented Jun 11, 2019

Sinjo commented Jun 11, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

brian-brazil commented Jul 6, 2019

lawrencejones commented Jul 6, 2019

hmac commented Jul 8, 2019 • edited Loading

lawrencejones commented Jul 8, 2019

brian-brazil commented Jul 8, 2019

coveralls commented Jun 10, 2019 •

edited

Loading

hmac commented Jul 8, 2019 •

edited

Loading