-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid UTF8 bytes in labels #128
Conversation
This change forces all labels into UTF-8, replacing any invalid byte sequences. This ensures that we can gracefully handle malformed labels. Signed-off-by: Harry Maclean <harrymaclean@gocardless.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM.
In other clients we throw an exception when this happens. |
@brian-brazil the downside of throwing an exception is it will cause the exporter to blow up, so we'll serve no metrics for as long as the bad label is present. I do think it would be useful to know what label is causing a problem, so was considering logging if we encounter a malformed label. What do you think? |
It's generally best to make it obvious when there's bad data, rather than subtly failing by providing what is a valid label instead. This is the sort of thing that should be handled outside the client library. |
In general I agree with the principle of surfacing errors early and loudly and it's a principle we largely follow in this library. What seems a shame that the impact of a single error is magnified - an entire service stops reporting metrics until the dodgy label is cleared from the registry. Another option would be to omit the metric, though I can see that being confusing to debug. In some earlier discussions @dmagliola and I did toy around with the idea of a global error handler as part of the client library configuration. The idea being that people could write a small callback function that receives an exception and sends it off to their preferred exception tracker or logs it. The original motivation behind it was to give people the option of a softer failure mode than the client library raising an exception (e.g. when label keys don't match the ones declared at the metric definition). In the end, we weren't completely sold that it was a necessary addition to the library, so we deferred the work until we had motivation for it. Perhaps this is provides that motivation? |
That'll be caught by
That's a programming error, and an exception is appropriate. |
Totally agree, and to be clear that would remain the default behaviour. The idea behind the configurable error handler, at least when we thought about it in the context of incrementing/setting metrics, is that while it's important to know your instrumentation code is working, you want to minimise its impact on production traffic when it isn't. This is an approach we've successfully used here at GoCardless - specifically in our logging library. It has a last ditch Not saying we have to go that specific route here in this library (e.g. @hmac's approach to this change wouldn't add any motivation for it), but that's the context for that approach. |
We're already filtering label values to escape newlines and backslashes link Is it meaningfully different to replace a bad utf8 character with an accepted placeholder for unrecognised symbols? I can't imagine anyone ever wanting their labels to crash the text formatter. I'm also struggling to see how this helps developers build more robust code. The most common situation I see here is:
I'd be keen to keep the current behaviour if it helped devs catch issues before they get to production, but I don't think that applies in this case. Every time the Prometheus client crashes production code damages efforts to encourage good practice around building reliable software, and I'd personally resent it if my metrics failed scraping because a tiny part of a larger app mishandled the encoding of a label. |
Yes, one is escaping valid input in line with the format. The other is taking invalid input and guessing. Other clients throw exceptions in this case. |
Cool, it doesn't feel different to me, but i can see how others might disagree. Could you help me figure out how we should address the other comments I made @brian-brazil? I understand that other clients don't behave like this but it being in other clients doesn't seem a valid argument to me, and if we stick with that we'll never improve the status quo. |
What would you suggest we do when data that violates the API contract is passed in? |
By the API contract, do you mean what we expect from a label in Prometheus? I think Chris made some good suggestions about either pushing this out to an error handler in the gem, or logging as appropriate. Anything to allow the library behaviour to gracefully degrade rather than blow up seems more appropriate for an instrumentation library, or at least more useful. |
Yes, the contract is that you pass in a UTF-8 value. By any form of munging you potentially cause a label clash, as the output label is a valid value the user could have passed in - so there is no graceful degradation, just second-guessing the user so that things break differently. |
Yeah I think this is the primary issue with this solution- if you have two labels with differing invalid characters, they may map to the same filtered label. Given the specification for text format parsing probably doesn't cover this case, depending on the implementation you'd get different series, so I think this particular solution is probably not acceptable. Still trying to focus on not blowing up all the metrics when we generate the text format with a bad encoding, how do you feel about an error handler that can present the error but continue while dropping the metric with the bad label? |
That'd be invalid input.
Experience with such things elsewhere has shown that that's a very bad idea, as you end up silently dropping arbitrary series on each scrape. |
That's good to know, as it means the approach in this PR might cause a similar level of error up the chain, which goes against the intent which was to reduce the scope of the error.
Also really useful context, and I'd imagine this state you describe- where you unwittingly lost metrics, unaware that your error was silenced- would have equally far reaching consequences as dropping all the series. I think this might be mitigated if the default error handler would synchronously raise, and users could set this to something that made more sense for them if they wished. For the library this patch was meant to help, fluentd, you may never care about errors. Alternatively, we could have a client mode where incrementing labels with metrics would do this character mapping when identifying the metric to increment instead of when we render the text format, though of course at this point you could argue this filtering is best done in the app, as you have suggested Brian. I think @Sinjo and I can have a conversation on Monday about whether any of these alternatives make sense, or if it's best to leave this as is. Thanks for your feedback though Brian, it was useful if only to rule out this PR and provide surrounding context. @hmac do you reckon we can close this? |
No strong objections to closing this, since we no longer have an immediate need for it. I do somewhat disagree with the objections raised here, though. I think the likelihood of label clash caused by two labels that differ only by a series of bytes which are themselves an invalid UTF-8 sequence is pretty low, and you'd probably spot it quite fast as they'd appear as |
I'm not sure it would be correct in all cases for us to run the risk of a label clash, but I fully agree with you that the situation where the metric endpoint dies from one invalid label is wrong. I'm going to continue to think about how we could be more robust to that situation while providing useful feedback to the user about how to fix their labels. |
This is typically caught at instrumentation time, not exposition time. |
This change forces all labels into UTF-8, replacing any invalid byte
sequences. This ensures that we can gracefully handle malformed labels.
@dmagliola @Sinjo could you review?