Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid UTF8 bytes in labels #128

Closed
wants to merge 1 commit into from

Conversation

hmac
Copy link

@hmac hmac commented Jun 10, 2019

This change forces all labels into UTF-8, replacing any invalid byte
sequences. This ensures that we can gracefully handle malformed labels.

@dmagliola @Sinjo could you review?

This change forces all labels into UTF-8, replacing any invalid byte
sequences. This ensures that we can gracefully handle malformed labels.

Signed-off-by: Harry Maclean <harrymaclean@gocardless.com>
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM.

@coveralls
Copy link

coveralls commented Jun 10, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling 0593ca6 on gocardless:hmac/bad-utf8 into 6ea675e on prometheus:master.

@brian-brazil
Copy link

In other clients we throw an exception when this happens.

@hmac
Copy link
Author

hmac commented Jun 10, 2019

@brian-brazil the downside of throwing an exception is it will cause the exporter to blow up, so we'll serve no metrics for as long as the bad label is present.

I do think it would be useful to know what label is causing a problem, so was considering logging if we encounter a malformed label. What do you think?

@brian-brazil
Copy link

It's generally best to make it obvious when there's bad data, rather than subtly failing by providing what is a valid label instead. This is the sort of thing that should be handled outside the client library.

@Sinjo
Copy link
Member

Sinjo commented Jun 11, 2019

In general I agree with the principle of surfacing errors early and loudly and it's a principle we largely follow in this library.

What seems a shame that the impact of a single error is magnified - an entire service stops reporting metrics until the dodgy label is cleared from the registry.

Another option would be to omit the metric, though I can see that being confusing to debug.

In some earlier discussions @dmagliola and I did toy around with the idea of a global error handler as part of the client library configuration. The idea being that people could write a small callback function that receives an exception and sends it off to their preferred exception tracker or logs it.

The original motivation behind it was to give people the option of a softer failure mode than the client library raising an exception (e.g. when label keys don't match the ones declared at the metric definition). In the end, we weren't completely sold that it was a necessary addition to the library, so we deferred the work until we had motivation for it.

Perhaps this is provides that motivation?

@brian-brazil
Copy link

An entire service stops reporting metrics until the dodgy label is cleared from the registry.

That'll be caught by up, so it should be dealt with prompty.

e.g. when label keys don't match the ones declared at the metric definition

That's a programming error, and an exception is appropriate.

@Sinjo
Copy link
Member

Sinjo commented Jun 11, 2019

That's a programming error, and an exception is appropriate.

Totally agree, and to be clear that would remain the default behaviour.

The idea behind the configurable error handler, at least when we thought about it in the context of incrementing/setting metrics, is that while it's important to know your instrumentation code is working, you want to minimise its impact on production traffic when it isn't.

This is an approach we've successfully used here at GoCardless - specifically in our logging library. It has a last ditch rescue that reports an error to our exception tracker if something unexpected goes wrong. The outcome being that we both know about the programming error to fix it and our production traffic doesn't feel the impact.

Not saying we have to go that specific route here in this library (e.g. @hmac's approach to this change wouldn't add any motivation for it), but that's the context for that approach.

@lawrencejones
Copy link
Contributor

We're already filtering label values to escape newlines and backslashes link

Is it meaningfully different to replace a bad utf8 character with an accepted placeholder for unrecognised symbols?

I can't imagine anyone ever wanting their labels to crash the text formatter. I'm also struggling to see how this helps developers build more robust code. The most common situation I see here is:

  • Development uses well formatted inputs, so doesn't trigger the issue
  • Testing as above, no exception raised
  • Put your code in production and an unsanitised input finds its way into a Prometheus metric. Now your exporter fails and you can kiss goodbye to your metrics

I'd be keen to keep the current behaviour if it helped devs catch issues before they get to production, but I don't think that applies in this case. Every time the Prometheus client crashes production code damages efforts to encourage good practice around building reliable software, and I'd personally resent it if my metrics failed scraping because a tiny part of a larger app mishandled the encoding of a label.

@brian-brazil
Copy link

Is it meaningfully different to replace a bad utf8 character with an accepted placeholder for unrecognised symbols?

Yes, one is escaping valid input in line with the format. The other is taking invalid input and guessing.

Other clients throw exceptions in this case.

@lawrencejones
Copy link
Contributor

Cool, it doesn't feel different to me, but i can see how others might disagree.

Could you help me figure out how we should address the other comments I made @brian-brazil? I understand that other clients don't behave like this but it being in other clients doesn't seem a valid argument to me, and if we stick with that we'll never improve the status quo.

@brian-brazil
Copy link

What would you suggest we do when data that violates the API contract is passed in?

@lawrencejones
Copy link
Contributor

By the API contract, do you mean what we expect from a label in Prometheus? I think Chris made some good suggestions about either pushing this out to an error handler in the gem, or logging as appropriate. Anything to allow the library behaviour to gracefully degrade rather than blow up seems more appropriate for an instrumentation library, or at least more useful.

@brian-brazil
Copy link

By the API contract, do you mean what we expect from a label in Prometheus?

Yes, the contract is that you pass in a UTF-8 value. By any form of munging you potentially cause a label clash, as the output label is a valid value the user could have passed in - so there is no graceful degradation, just second-guessing the user so that things break differently.

@lawrencejones
Copy link
Contributor

Yeah I think this is the primary issue with this solution- if you have two labels with differing invalid characters, they may map to the same filtered label. Given the specification for text format parsing probably doesn't cover this case, depending on the implementation you'd get different series, so I think this particular solution is probably not acceptable.

Still trying to focus on not blowing up all the metrics when we generate the text format with a bad encoding, how do you feel about an error handler that can present the error but continue while dropping the metric with the bad label?

@brian-brazil
Copy link

Given the specification for text format parsing probably doesn't cover this case, depending on the implementation you'd get different series, so I think this particular solution is probably not acceptable.

That'd be invalid input.

how do you feel about an error handler that can present the error but continue while dropping the metric with the bad label?

Experience with such things elsewhere has shown that that's a very bad idea, as you end up silently dropping arbitrary series on each scrape.

@lawrencejones
Copy link
Contributor

That'd be invalid input.

That's good to know, as it means the approach in this PR might cause a similar level of error up the chain, which goes against the intent which was to reduce the scope of the error.

Experience with such things elsewhere has shown that that's a very bad idea, as you end up silently dropping arbitrary series on each scrape.

Also really useful context, and I'd imagine this state you describe- where you unwittingly lost metrics, unaware that your error was silenced- would have equally far reaching consequences as dropping all the series.

I think this might be mitigated if the default error handler would synchronously raise, and users could set this to something that made more sense for them if they wished. For the library this patch was meant to help, fluentd, you may never care about errors.

Alternatively, we could have a client mode where incrementing labels with metrics would do this character mapping when identifying the metric to increment instead of when we render the text format, though of course at this point you could argue this filtering is best done in the app, as you have suggested Brian.

I think @Sinjo and I can have a conversation on Monday about whether any of these alternatives make sense, or if it's best to leave this as is. Thanks for your feedback though Brian, it was useful if only to rule out this PR and provide surrounding context.

@hmac do you reckon we can close this?

@hmac
Copy link
Author

hmac commented Jul 8, 2019

No strong objections to closing this, since we no longer have an immediate need for it.

I do somewhat disagree with the objections raised here, though. I think the likelihood of label clash caused by two labels that differ only by a series of bytes which are themselves an invalid UTF-8 sequence is pretty low, and you'd probably spot it quite fast as they'd appear as foo_bar_⍰. As a user I'd much prefer this munging behaviour (plus a log telling me I need to fix it) to a totally non-functioning metrics endpoint. Currently the error raised gives no clue as to what label is problematic, so you could be spending a lot of time tracking down the issue - all the while you have no metrics in production!

@hmac hmac closed this Jul 8, 2019
@lawrencejones
Copy link
Contributor

I'm not sure it would be correct in all cases for us to run the risk of a label clash, but I fully agree with you that the situation where the metric endpoint dies from one invalid label is wrong. I'm going to continue to think about how we could be more robust to that situation while providing useful feedback to the user about how to fix their labels.

@brian-brazil
Copy link

This is typically caught at instrumentation time, not exposition time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants