Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider propagating certain remote exceptions #834

Closed
ellisjoe opened this issue Oct 9, 2018 · 28 comments
Closed

Consider propagating certain remote exceptions #834

ellisjoe opened this issue Oct 9, 2018 · 28 comments
Assignees

Comments

@ellisjoe
Copy link
Contributor

ellisjoe commented Oct 9, 2018

What happened?

Service A calls Service B which throws a 403 Unauthorized exception. Currently Service A will rethrow that a an Internal Error (see: https://github.com/palantir/conjure-java-runtime/blob/develop/conjure-java-jersey-server/src/main/java/com/palantir/conjure/java/server/jersey/RemoteExceptionMapper.java) .

What did you want to happen?

We started mapping RemoteExceptions to Internal Errors here: #801. We'd like to continue to do that, though there seems to be some class of RemoteExceptions that may be acceptable to propagate. An initial proposal would be to propagate 403s directly, rather than remapping.

@pnepywoda
Copy link
Contributor

Propagating 403 is pretty important, as it should be displayed to users

@gregakinman
Copy link
Contributor

From what I can tell, the decision made in #801 was a lesser-of-two-evils choice between:

  • clients potentially introspecting RemoteExceptions from an upstream server but having whatever error happened be potentially more clear (you might get a UpstreamServer:SpecificError)
  • clients only introspecting RemoteExceptions from the server they're talking to directly but having whatever error happened potentially be obscured if the downstream server doesn't catch the upstream error

Are we sure that the latter situation is better than the former?

Now, every client has to update every time a server adds a new type of error. And before clients update at all, no exceptions get propagated and everything is an opaque internal error. Additionally, you can get situations like the following: a server and a Spark application both live in the same repository and are considered part of the same service, despite being two separate processes. They could share the same error namespace. Previously, the Spark application could throw an error which would then just pass through the server. Now, though, the server will have to catch and rethrow each of these errors to preserve them, which adds a lot of verbosity.

@pnepywoda
Copy link
Contributor

There's definitely a certain class of errors you'd want to propagate all the way to the calling user, like NotFound or PermissionDenied. Other than that, we wouldn't want say an OutOfDiskSpace database error returning to the user so anything else would have to be whitelisted if you actually care to keep the error name when displaying to the user

@gregakinman
Copy link
Contributor

Why not?

@gregakinman
Copy link
Contributor

Guess it could seem out of the blue. But it's more informative than "internal"...

@pnepywoda
Copy link
Contributor

Because there's nothing the user can do about it.
Maybe a better example is an endpoint which does paging and the caller passed in larger than is acceptable. In this case you really don't want to show the user an error in the calling service

@gregakinman
Copy link
Contributor

Ok, true. I guess if the client can do anything about it should be our metric for propagation. So maybe just propagate all 4xx?

@pnepywoda
Copy link
Contributor

Unfortunately all the "IllegalArgument" exceptions like "page size too big" are 4xx

@gregakinman
Copy link
Contributor

Ok, currently can't think of any clear non-403s that we should propagate, let's just go with 403s for now then, can always add more later if they come up.

@diogoholanda
Copy link
Contributor

Could we propagate just the status code, but not the actual error?
My main concern is that there will be a bunch of spurious 5xx now. Even if I don't want to inspect other services exceptions, I still want 4xx to be propagated back as 4xx, not 5xx

@diogoholanda
Copy link
Contributor

Or even if not the whole error code, at least the error code familiy. So, any 4xx would be converted to a 400 for instance

@ellisjoe
Copy link
Contributor Author

What would you do with the 400? In most of the cases we've discussed there's nothing the user can do to fix the request that was made. Eg: your service requested too many items from another service. The only time the user (caller of your api) is able to do anything about a remote 4xx is when you're just proxing some values to another service that the user controls. In that case (when you're proxying) it seems reasonable for you to also propagate the expected errors.

The 403 example is one I can buy for propagating, so I'm interested in the other specific errors you'd like propagated @diogoholanda

@j-baker
Copy link
Contributor

j-baker commented Oct 21, 2018

I think @diogoholanda's referring to the fact that across service boundaries, we typically use 4xx vs 5xx differently from the perspective of alerting.

Regardless, I agree with your conclusion. 4xx means 'the client's request was invalid' and across service boundaries it's tricky to infer anything about who made the mistake. If I have browser -> service A -> service B and service B says 'requested batch size too big', the error can only be that of service A (at a minimum they've coupled their API to that of service B).

403 is a reasonable exception to this rule because in our internal architecture, it's 99% of the time a client concern, so the false positive rate is expected low.

@j-baker
Copy link
Contributor

j-baker commented Oct 21, 2018

And, to the original question:

The purpose of the Conjure errors was to provide a way to treat errors as API. If you are changing the types of exception that are being thrown, you are breaking your API. As usual the barrier for this is 'if you upgrade your server and a client of yours breaks, you broke your API'.

A model I've seen used internally (with some success) is that of:

  • Zero exceptions are API by default.
  • If you want an exception to be part of the service contract, you add a test which verifies this. An example might be 'we have a test which verifies that if there was a permissions issue, you actually get a 403 out'.
  • It is then part of your contract, and the test verifies that you can't break it without changing your API.

@diogoholanda
Copy link
Contributor

I think the disagreement in this issue is coming exactly from the treatment been given to conjure errors. Last time there was a conversation about whether Conjure errors are API or not, the agreement was that they were not. The API guarantees were basically that any endpoint can throw errors and those errors might contain structured information of some kind.

There were a couple reasons for that, and some other in my mind:

  • There is no way in Conjure for you to specify which endpoints throw which errors. So there is no way for you to specify such an API and for consumers to know what it is.
  • Even if there was a way to declare that, we would also need a way to ensure that servers implementing such an API actually respect it and don't throw errors they haven't declared.
  • How does one improve their APIs by changing their errors? For instance, do we need to duplicate every endpoint when adding QoS exceptions or when limiting the size of requests?

I am open to revisiting what would mean to treat Conjure errors as API, but in the current state of affairs I think that propagating the the structured information about the errors to users (even if they can't act on it) is a lesser evil than requiring clients to read through the code of their dependencies to figure out which error can be thrown by which endpoint and then decide what to re-define in their own namespace (which I also don't believe is the direction we would like to go).

Should we talk in person about this?

1 similar comment
@diogoholanda
Copy link
Contributor

I think the disagreement in this issue is coming exactly from the treatment been given to conjure errors. Last time there was a conversation about whether Conjure errors are API or not, the agreement was that they were not. The API guarantees were basically that any endpoint can throw errors and those errors might contain structured information of some kind.

There were a couple reasons for that, and some other in my mind:

  • There is no way in Conjure for you to specify which endpoints throw which errors. So there is no way for you to specify such an API and for consumers to know what it is.
  • Even if there was a way to declare that, we would also need a way to ensure that servers implementing such an API actually respect it and don't throw errors they haven't declared.
  • How does one improve their APIs by changing their errors? For instance, do we need to duplicate every endpoint when adding QoS exceptions or when limiting the size of requests?

I am open to revisiting what would mean to treat Conjure errors as API, but in the current state of affairs I think that propagating the the structured information about the errors to users (even if they can't act on it) is a lesser evil than requiring clients to read through the code of their dependencies to figure out which error can be thrown by which endpoint and then decide what to re-define in their own namespace (which I also don't believe is the direction we would like to go).

Should we talk in person about this?

@iamdanfox
Copy link
Contributor

I think a lot of pain here is caused by how services/clients are often relying on Conjure-defined errors for different things (some using them to convey important information in params, some for alerting signal, others just using them for debugging, others displaying them directly to users). To get to a more something more optimal, I think it might be helpful to write up an actual design document about what we're intending Conjure errors to achieve and how best to use them - then at least we can enumerate all the possible things people are using them for and try to find dedicated solutions, rather than just shoehorning the current errors implementation to fit them.

Obviously #801 should have been included in the 4.1.1 changelog - that was a mistake on my part, so I've amended the release notes. I don't want to revert that PR though because I think the justification for still stands (writing client-side code that couples applications to errors from upstream transitive services seems like a bad time).

@gregakinman
Copy link
Contributor

OK on the operational note about finding solutions for what people want to do with Conjure errors. Our team uses them for conveying important information to clients on why requests fail (e.g., 403s), so we've pinned 4.1.0 as our dependency until we have a solution that allows us to do that without having to needlessly rethrow every error that has information we'd like to propagate. Let's get to that solution soon so that we can take regular upgrades again. @iamdanfox who should own this design document?

@pnepywoda
Copy link
Contributor

ping @iamdanfox

@pkoenig10
Copy link
Member

pkoenig10 commented Dec 5, 2018

Another downside of this change is that 5xx rate metrics are now significantly less useful.

Previously an increased 5xx rate would indicate that either:

  • Your service is returning more 5xx responses
  • Your dependencies are returning more 5xx responses

Both of these scenarios are indicative of problems occurring somewhere and thus the 5xx rate is a good signal that something is wrong.

But now, 4xx responses are also propagated as 5xx errors. A single 4xx response may trigger a number of 5xx responses in the call chain. So an increased 5xx may now also be due to:

  • Your dependencies are returning more 4xx responses

This pollutes the signal in the 5xx rate metric. 4xx responses are often not indicative of problems and instead due to incorrect input or request conflicts, both of which occur during normal operation.

@markelliot
Copy link
Contributor

markelliot commented Dec 5, 2018 via email

@pnepywoda
Copy link
Contributor

I think it's fairly often that services pass-through user-arguments to other services and those deep 403's you definitely want to propagate

@ellisjoe ellisjoe mentioned this issue Dec 5, 2018
@markelliot
Copy link
Contributor

We agreed to propagate 403s as a special case. #904

@pnepywoda
Copy link
Contributor

This is a good start, but we really should be doing 400 as well since many services call others by passing through some portion of the user-arguments they were given

@markelliot
Copy link
Contributor

We understand your position but as discussed on linked PRs and this issue, isn’t something this project is going to revert. We’d continue to note that if you value forwarding specific errors your service can absolutely continue to do so, explicitly.

@gregakinman
Copy link
Contributor

What's the additional detail that is causing this project to not revert this change despite understanding the position against the change?

This change was made to avoid some sort of badness. The badness in question appears to be clients potentially compiling against errors that are declared by a transitive dependency. How bad is this, really? What examples do we have of bad situations that this is intended to fix? I.e., what clients did this, and how did it result in a bad time, and why is the new world better?

Let's examine the situations before and after this change:

Before this change:

  • clients might compile against errors that are declared by a transitive dependency, breaking abstraction around the direct dependency's implementation of its API
  • errors from transitive dependencies, and all the useful information they contain, are propagated to clients, which makes server code simple, as it just assumes the propagation of any errors from dependencies up to the client. This is how it is with libraries

After this change:

  • clients are incentivized to compile against errors declared by their direct dependency due to increased abstraction around the direct dependency's implementation of its API
  • errors from dependencies must be explicitly propagated to clients, which makes server code complicated, or at best, results in a paradigm where each RPC client is wrapped in a version that try/catches every call and tries to map errors from the dependency to errors declared by the server itself. (This is basically what is being advocated for by "if you value forwarding specific errors your service can absolutely continue to do so, explicitly.") The problem with this is that dependencies can add new error types without a major version bump, which means that that error could go unhandled by the server if the server hasn't been updated to account for it yet, which results in error information being lost.

Let's look at these bullet points individually:

  • Abstraction is a good thing in general, but I'd argue this is not the place. This is a library designed for use by all services in a set of microservices that all compose a cohesive system. Internally, various services include arguments in their API that are not their own concept, thus "leaking" to a client the existence of the service that owns that concept. Transactions, I think, is a prominent example of this. (Note I'm not talking about using another service's objects in one's API - I'm referring to the reference of common resources, basically the point of the ResourceIdentifier library). This is all a cohesive system.
  • It seems the only way for servers to properly handle this is to essentially surround every call to a dependency with a try/catch. Because there's no way to declare what endpoints throw what errors (and this can also change in a minor or patch version), the most conservative thing is just to catch every error declared by the dependency on every call to it. Even this is not foolproof, since, as mentioned above, dependencies can add new errors without the calling server knowing.

I still have not heard a detailed argument as to why the before state is worse than the after state other than:

  • It's never been acceptable to rely on transitive service specific exceptions in consumer code and that's what we'd like to ban with the new implementation (I believe I've addressed above why this is not as bad as the world after this change)
  • If you are changing the types of exception that are being thrown, you are breaking your API. As usual the barrier for this is 'if you upgrade your server and a client of yours breaks, you broke your API'. (This is not the consensus opinion. Counterexample: @diogoholanda above was under the impression that Conjure errors are not API (as are others). Also, properly having the system that would allow this to be the case would involve building akin to checked exceptions but in RPC. We don't have that system in place currently, and I don't think this can be the consensus opinion until we have that in place).

I agree that the before state is not ideal - I'm in favor of @iamdanfox's idea that we fix Conjure errors to make this system make more sense - but I don't think we should regress before we make an improvement. As a result, our team has pinned 4.1.0 (the latest version without this change) in our services. Eventually, we will have to upgrade for some reason or another (and likely with some urgency). I'd really like to resolve this before that time comes. Let's set up some time to discuss this synchronously.

@gregakinman
Copy link
Contributor

@uschi2000 - I've forwarded you the calendar invite if you'd like to join the discussion.

@palantir palantir deleted a comment from uschi2000 Dec 10, 2018
@vandervecken
Copy link

An additional argument for rethrowing, looking at the support angle.
Let's say there is a systemic failure somewhere and an error is being thrown from some service relatively deep and far from the user. Suppose 10 users have failures.
In the scenario where generic 500s are what come through, you have 10 different users with 10 different support tickets requesting help for generic something-went-wrong problems.
Now suppose you are automatically rethrowing. All 10 users see some unintelligible (to them) error, but crucially, they all see the same specific error, which means you can direct them all to the same support ticket. This significantly eases the support burden, even though the error itself provides no useful information to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants