Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix the race between channel shutdown and clientCallImpl start #3287

Closed
wants to merge 1 commit into from

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Jul 27, 2017

This PR tries to fix the race described and updated in #1981 by introducing a ReadWriteLock.

Tests will be added if the approach is acceptable.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this PR would be better off only handling when deadlineCancellationExecutor rejects execution and use that to cancel the RPC. That seems simpler, doesn't add an additional race, and improves error handling even when the channel isn't terminated.

I would prefer fixes the race instead of mitigating it; I think fixing #2562 may help us in that regard.


@Override
public boolean cancel(boolean mayInterruptIfRunning) {
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return false if the future is already cancelled.

@@ -310,11 +330,52 @@ public void run() {
}
}

private ScheduledFuture<?> startDeadlineTimer(Deadline deadline) {
private ScheduledFuture<?> startDeadlineTimer(final Deadline deadline) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't use any part of the ScheduledFuture interface. Maybe just return Future<?> and then use a pre-existing Future implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the pre-existing Future implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least there's FutureTask and Futures.immediateCancelledFuture().


@Override
public boolean isDone() {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return true if the future is cancelled.

}
}

return new CancelledFuture();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel comfortable silently ignoring the deadline. That seems just asking for undiagnosable run-away RPCs. The graceful way to handle this would be to fail the RPC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPC would fail automatically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPC is not guaranteed to fail here. You're hoping that the rejection is related to channel termination. But it could have been that the queue was too long. Or it could be that the user shut down their executor before gRPC terminated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I got what you concerned.

@@ -149,6 +151,26 @@ public void start(final Listener<RespT> observer, Metadata headers) {
checkNotNull(observer, "observer");
checkNotNull(headers, "headers");

if (deadlineCancellationExecutor == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not excited about having a very different code path during the race vs after (with this case being after). It makes it all the more difficult to find bugs during the race.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with removing this code path.

@@ -554,6 +555,32 @@ public String authority() {
}
}

private final class DirectFallbackExecutor implements Executor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an interesting idea, although coupled with the ClientCallImpl change, it will cause us to use the executor after shutdown more than previously. Previously we would never use the executor after shutdown; only the deadlineCancellationExecutor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems previously we also would use the executor after shutdown: anything inside ClientCallImpl.start() may be happening after shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that with FailingClientTransport it would be directly calling back the application, but I guess it does go through the listener which uses the callExecutor. The usages of callExecutor in start are rarely triggered.

command.run();
}
});
} catch (RejectedExecutionException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on the rejection, can we set executor = null when terminated and then check it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we can check terminated directly, but still a race.
If we check executor =! null, and run with it, there could be a NPE because of the race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPE is not a concern:

Executor saved = executor;
if (saved == null) {
  command.run();
} else {
  saved.execute(command);
}

I agree there is still a race between checking executor and running with it, but it is dramatically smaller and controlled by our code instead of the executor.

In my mind the race is calling executor after the channel has terminated. None of these solutions actually fix that. They just mitigate it.

Copy link
Member Author

@dapengzhang0 dapengzhang0 Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that none of these solutions actually fix the race. But with the rejection check, the only unresolved race is very very benign as follows:
The executor.execute() is called and throws a RejectedExecutionException for whatever unlikely reason before the channel is terminated, and channel is terminated when we check it in the catch block, so we don't throw this exception, and call command.run() directly.

Other cases are handled gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. The rejection check continues to allow calls to executor. That's the race. executor may be provided by the application and it's not safe to call after termination. Or at least that's been my assumption.

I could believe the argument that we should always use the executor, even after termination, based on the theory that the application is expecting us to use executor (so it may be "magical") and it's the application's own fault for making an RPC after shutdown (not termination). But in that case we mustn't use direct if there was a RejectionExecutionException. I'm not sure how many options we have, but we'd maybe need to propagate the exception to the caller and we never call their listener.

But this seems to be neither of these. I'm not sure what API theory this change is based on.

@@ -535,7 +536,7 @@ public String authority() {
CallOptions callOptions) {
Executor executor = callOptions.getExecutor();
if (executor == null) {
executor = ManagedChannelImpl.this.executor;
executor = new DirectFallbackExecutor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not create this every time.

@dapengzhang0 dapengzhang0 force-pushed the fix1981 branch 8 times, most recently from c13a434 to 351e384 Compare August 7, 2017 16:43
@dapengzhang0 dapengzhang0 changed the title core: mitigate the race between channel termination and clientCallImpl start core: fix the race between channel termination and clientCallImpl start Aug 7, 2017
@dapengzhang0 dapengzhang0 force-pushed the fix1981 branch 3 times, most recently from 0727240 to 63d7d2f Compare August 7, 2017 21:23
@dapengzhang0 dapengzhang0 changed the title core: fix the race between channel termination and clientCallImpl start core: fix the race between channel shutdown and clientCallImpl start Aug 8, 2017
@dapengzhang0
Copy link
Member Author

obsolete

@lock lock bot locked as resolved and limited conversation to collaborators Feb 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants