Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch exceptions thrown by Executor.execute #636

Open
ejona86 opened this issue Jul 16, 2015 · 10 comments
Open

Catch exceptions thrown by Executor.execute #636

ejona86 opened this issue Jul 16, 2015 · 10 comments
Labels
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Jul 16, 2015

direct executor lets RuntimeExceptions pass through the call stack. We need to defend against that in places we would permit direct executor. Even without direct executor, execute can throw with rejected exception, so it is really a case we should handle.

@ejona86 ejona86 added the bug label Jul 16, 2015
@ejona86 ejona86 added this to the 0.8.0 milestone Jul 16, 2015
@ejona86 ejona86 modified the milestones: Beta (0.9.0), 0.8.0 Aug 14, 2015
@ejona86 ejona86 added this to the 1.0 milestone Dec 1, 2015
@ejona86 ejona86 modified the milestones: 0.11.0, 1.0 Jan 11, 2016
@zhangkun83 zhangkun83 modified the milestones: 1.0, 0.13.0 Jan 29, 2016
@hsaliak hsaliak modified the milestones: 1.1, 1.0 Apr 26, 2016
@hsaliak hsaliak added the P2 label Apr 26, 2016
@ejona86 ejona86 self-assigned this Aug 23, 2016
@ejona86 ejona86 modified the milestones: Next, 1.1 Jan 6, 2017
@ejona86
Copy link
Member Author

ejona86 commented Mar 5, 2017

The most basic handling should be catching the exception and using it to cancel the call. However, this will appear to the application like such calls are "leaked," because the cancellation wouldn't likely be delivered to the application (and even the Context wouldn't be cancelled).

But we need to support things like RejectedExecption better (primarily on server-side, but it can happen on client-side, too). This is a common mechanism to deal with overload. Ideally, grpc-java would have easy configuration for limiting the resources used, so that the application wouldn't need to configure the executor to reject tasks. Such configuration has the benefit of allowing gRPC to inform remote clients to stop/slow sending new RPCs (via MAX_CONCURRENT_STREAMS), or similar. But even if grpc-java has nice configuration here, it should still support RejectedException cleanly because it is application-wide (not just the grpc server) and simply because executors are permitted to reject Runnables.

On the server, if the Runnable that calls startCall() fails to be scheduled, then simply logging is probably appropriate, because the application wouldn't be aware it existed [1].

But for the other cases, we may want a second, "fallback" executor. It could be a single thread, or it could be the shared executor we typically use (probably in concert with SerializingExecutor to make sure it doesn't consume too many resources). The application should be allowed to override it, but separately from the default executor() override used today. When an execute() throws, the "real" executor used by the call's SerializingExecutor would be changed to the fallback executor, and then the call would be cancelled.

Unfortunately, if it is the onClose()/onComplete()/onCancel() callback that "fails initially", then cancelling the call won't trigger another callback. So this probably needs to be handled as a special case.

We need to also make sure that we close any message InputStreams.

  1. This isn't quite true in the case of direct executor. But in that case a future "scheduling" of the cancellation Runnable would succeed, because it will run on the same thread.

@njhill
Copy link
Contributor

njhill commented Jun 15, 2020

Included another suggestion for this here: #7106 (comment)

@ejona86
Copy link
Member Author

ejona86 commented Jul 7, 2020

(Responding here to keep design ideas here)

@njhill wrote:

Thanks @ejona86, I see that the more general part of this is a bit tricky. I wonder if a fallback executor could be avoided. There are two realistic cases I think - temporary thread/queue exhaustion and (permanent) premature executor shutdown.

The latter case constitutes an app bug where all bets are off since the channel is then "broken" at least for async usage. It seems reasonable in this case to just log the error and not run any call close callbacks.

To accommodate the former case, you could schedule a repeating task to re-attempt scheduling of the SerializingExecutor onto the user's executor at some interval (like 2sec), after cancelling the call and ensuring to run any non-callback cleanup logic inline (including releasing messages after an onMessage is rejected). This has advantages of not requiring a new executor, and of callbacks not being made from an unexpected executor which could have other implications for the app.


To accommodate the former case, you could schedule a repeating task to re-attempt scheduling of the SerializingExecutor onto the user's executor at some interval (like 2sec)

I don't feel comfortable with a polling approach. At the very least that would increase load when already overloaded. I guess we could schedule a single Runnable in the executor that could call all the callbacks (and so there's be one 2 second timer, independent of number of streams), which would reduce the cost if rejection is still occurring. But I don't feel like this would notify the application quickly enough, since it is likely overloaded.

Maybe if there is a rejection we could try to run that one notify-of-cancellation Runnable every time we try to schedule onto the executor until it succeeds. That would work better. There'd be some synchronization required. I'm still sort of bothered by the delay of the cancellation notification, as I can think of some easy pathological cases, but it's probably "good enough" as a default. (Assuming someone asks for it,) I guess we could use a fallback executor just if one is provided, and otherwise use this scheme.

@garrison-stauffer
Copy link

garrison-stauffer commented Apr 19, 2022

Hey, thanks for logging/tracking this issue! We discovered this is happening for our own service today and it's causing us to leak some resources in streaming calls, which causes our servers to run themselves over in a sort of feedback loop (we are under load => we get rejected close calls => we fail to clean up resources that continue competing in the threadpool).

We're wondering if it's workable to use the ThreadPoolExecutor#CallerRunsPolicy to ensure that we're propagating onClose()/onCompletion() calls to our observer to make sure that we can clean up resources. It's not perfect because we're using it for our coroutines calls too when they're done executing, but we think it's best that if this error starts happening, that we're able to at least see that it's happening and try cleaning from there.

Would that be an okay work around?

@juliojgd
Copy link

juliojgd commented Jun 7, 2022

@ejona86 Any update about the prioritization of this issue? It was open more than 7 years ago and is it categorized as a bug.

Due to this issue, the executions rejected by the executor with a RejectedExecutionException can't be detected by current gRPC event handler methods so we can't make needed cleanups and metrics recording. Basically in that case we cannot do a thing. The CALLER_RUNS_POLICY workaround does not work for us as (for this case) it's a futile attempt to pass the responsibility to another layer, but we have not the information needed to accomplish the tasks we have to complete.

I think that having this issue unsolved brings a resilience issue to gRPC Java about properly managing tasks we are not able to compute when resources are getting exhausted (executors' resources).

@garrison-stauffer Using ThreadPoolExecutor#CallerRunsPolicy only transfers your problem to another layer (the caller one). If in that layer you are able to do your required actions (cleanups, observability measures, etc.) I think it's the only workaround you have as of today.

@ejona86
Copy link
Member Author

ejona86 commented Jun 8, 2022

When I created the issue I thought this would have been a bigger deal. But I think it turned out not to be for a couple reasons.

  1. It is mostly addressed in a similar way to the joke: "You tell your doctor that it hurts when you do X. The doctor tells you to stop doing X." If your executor doesn't reject, you don't have (this) problem.
  2. For those truly concerned with limiting resources, different services on the same port frequently need different limits. So configuring the single executor used by all of them only solves a simple form of the problem.
  3. Most memory/resources for an RPC are generally in the application, so not informing applications that RPCs are dead doesn't help much. gRPC must have an executor to run callbacks (and I agree that CALLER_RUNS_POLICY is of no help)
  4. Thankfully, there are extremely few users of direct executor.

Instead of rejecting runnables, I think the normal solution to use an interceptor that fails RPCs. (E.g., it can count the number of concurrent RPCs to a service and when it exceeds a limit it fails and can even report info to the client on when to retry.)

Some later comments sketch solutions for point 3. But points 1 and 2 still come into play and you can tell that very few people have complained about this. (I've discussed it in person with a reasonable number of people so they certainly notice and are disappointed, but point 3 morphs the conversation.) Point 2 was eroded by #8266. And I have some hopes to erode it further by letting you configure Executors on a per-service basis. But that's also because I want to get to a point where it is more useful to fix this.

@juliojgd
Copy link

juliojgd commented Jun 9, 2022

Hello @ejona86 thanks for your answer. Some clarifications about your points (and about our need):

In the point 1, an executor can reject and it's a normal behavior. I think if grpc-java is using an executor it should ease supporting the normal behavior of an executor, and rejecting it is part of this normal behavior (normal==API defines it). So handling that situation should be needed, and the lack of such handling should be considered a flaw IMHO. This comment applies both for the direct and any other configured executor.

We do use an interceptor that fails RPC´s. But for task limiting purposes we do need to have a counter (inflight tasks counter), we increase it when the task starts and we decrease it when the task ends. The only case where the counter never gets decreased is when there is a rejection in the executor and that's because we cannot find a place in the grpc-java interfaces where we can put our code to decrease the counter so in such cases the counter remain "un-decreased" making the counter useless as it keeps a wrong count anytime at least one rejection occurs in the whole JVM life.

For us, to have a point in grpc-java where to place our decrease-counter logic in every case (including rejections from executors) is crucial. Not having it makes useless such feature (the counter and then the task limiting feature) , as (we all agree) CALLER_RUNS_POLICY seems no solution here.

You could tell me: Always enable your task limiter and you will never have a rejection in the executor.

It's a good point but our task limiter can start disabled, and then be enabled in runtime, when you see in your metrics some anomalous parameter; maybe rejections because your sizing was low for the peak load you are receiving. If we have this feature, our counter would count right the inflight tasks and the rate limiter will be able to drop rc's and then recover. But without this, the counter has already missed a lot of decrement operations, so that's unrecoverable. Its value will be wrongly high (due to the missed decrements due to the rejections).

I hope I've explained our need and why we feel that having a situation that is not controlled in any existent hook means to leave loose ends . And that loose ends in our case invalidates a whole feature.

@ejona86
Copy link
Member Author

ejona86 commented Jun 9, 2022

So handling that situation should be needed, and the lack of such handling should be considered a flaw IMHO

There's literally no good way to handle it in the general case. From what I've seen, most Java code will leak like a sieve if it gets a rejection. With async code, there simply isn't a way to handle it safely. For example, consider ListenableFuture which just logs or that ForkJoinPool doesn't let you specify a limit. (ForkJoin pool will throw RejectedExecutionException, but only if shutdown or it gets an OutOfMemoryError, or some cases concerning concurrency which prove the point but are too complex to discuss here).

The only way to handle this is variations on the idea that somewhere there's an executor that will actually accept the Runnable. We might allow one executor to reject, but have a "backup" executor that won't reject.

For us, to have a point in grpc-java where to place our decrease-counter logic in every case (including rejections from executors) is crucial.

Exactly. You are telling me that some executor must accept the Runnable. The Runnable must be run. Period. But that's the opposite of rejection and demonstrates why rejection, while being a nice idea, is pretty hard to make use of in practice in async environments.

Rejection makes sense for simpler scenarios where unit-of-work == Runnable. In that case, rejecting the executable rejects the entire unit of work. Think servlets where requests are tied to a single thread and you use blocking I/O. But for async it falls apart.

A hard-line "the API says" won't get you anywhere. I am sympathetic to your cause but that approach is a dead end. I can simply say "rejected exceptions are unsupported" or that gRPC "handles" the rejection via crashing and leaking memory. Certainly, there could be no guarantee that your callback is executed. This is more of a give-and-take and not really a place for an API lawyer, as there is no constraint that says combining various APIs is guaranteed to produce a useful result. API lawyering can just help you predict the result.

(FWIW, RejectedExecutionException is a RuntimeException which many people consider "a programmer error", aka bug, if they are thrown. This is a correct interpretation if you get a rejection because the executor was shut down, for example. I do understand things aren't that black-and-white, but it is at least interesting that it isn't a checked exception.)

It's a good point but our task limiter can start disabled

"Do whatever you do for the Executor that is rejecting." If you start disabled, that's fine. Don't have the executor throw then. And then if the limit is changed you start failing new RPCs until enough old RPCs complete that you can accept new ones.

@juliojgd
Copy link

Thanks for your answer.

There's literally no good way to handle it in the general case.

Totally agree on this.

The only way to handle this is variations on the idea that somewhere there's an executor that will actually accept the Runnable. We might allow one executor to reject, but have a "backup" executor that won't reject.

Yes, I think the same.

There is no way to make sure that the "closing" tasks are executed independently of the "user" tasks, right?? (Current state of the code base)
I mean there is no way to ensure the execution of the "closing" tasks if you don't ensure the execution of "non-closing" tasks.

What we would need to achieve is that the closing tasks were executed. We saw a "cancellationExecutor" in "ServerImpl" code but, aside it is the same as the "non cancellation" executor (but non wrapped), I think it does not run all the "closing" tasks, only certain types of them. This is only to confirm with you, because you know deeply the code (and I don't 😢 )

Once this is confirmed, I think we'll adopt the approximation you advised, to "never reject" a task that is sent to the grpc executor to ensure all the tasks ("closing" and "non-closing" related) are run.

Thanks in advance for this final clarification.

@ejona86
Copy link
Member Author

ejona86 commented Jun 16, 2022

There is no way to make sure that the "closing" tasks are executed independently of the "user" tasks, right??

Correct. Because all the tasks that matter here are user tasks, which call the application-provided listener.

I mean there is no way to ensure the execution of the "closing" tasks if you don't ensure the execution of "non-closing" tasks.

Correct. And that's really the core of this issue.

To be honest, one of the things holding this up is whatever solution we decide here, we're probably stuck with it forever. Adding an executor tends to be disruptive.

We saw a "cancellationExecutor" in "ServerImpl" code

cancelExecutor. That is a bit different. That is for us to cancel the Context. We schedule that separately so that if the application is blocking it will still run. We don't trust users enough to cancel the Context directly from the network thread. (Cancellation listeners there provide their own executor, but people too often use directExecutor() without considering the consequences.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants