core: fix the race between channel shutdown and clientCallImpl start #3287

dapengzhang0 · 2017-07-27T17:39:55Z

This PR tries to fix the race described and updated in #1981 by introducing a ReadWriteLock.

Tests will be added if the approach is acceptable.

ejona86

I feel like this PR would be better off only handling when deadlineCancellationExecutor rejects execution and use that to cancel the RPC. That seems simpler, doesn't add an additional race, and improves error handling even when the channel isn't terminated.

I would prefer fixes the race instead of mitigating it; I think fixing #2562 may help us in that regard.

ejona86 · 2017-07-27T18:07:34Z

core/src/main/java/io/grpc/internal/ClientCallImpl.java

+
+        @Override
+        public boolean cancel(boolean mayInterruptIfRunning) {
+          return true;


should return false if the future is already cancelled.

ejona86 · 2017-07-27T18:13:02Z

core/src/main/java/io/grpc/internal/ClientCallImpl.java

@@ -310,11 +330,52 @@ public void run() {
    }
  }

-  private ScheduledFuture<?> startDeadlineTimer(Deadline deadline) {
+  private ScheduledFuture<?> startDeadlineTimer(final Deadline deadline) {


It looks like we don't use any part of the ScheduledFuture interface. Maybe just return Future<?> and then use a pre-existing Future implementation?

what's the pre-existing Future implementation?

At the very least there's FutureTask and Futures.immediateCancelledFuture().

ejona86 · 2017-07-27T18:15:01Z

core/src/main/java/io/grpc/internal/ClientCallImpl.java

+
+        @Override
+        public boolean isDone() {
+          return false;


should return true if the future is cancelled.

ejona86 · 2017-07-27T18:17:40Z

core/src/main/java/io/grpc/internal/ClientCallImpl.java

+        }
+      }
+
+      return new CancelledFuture();


I don't feel comfortable silently ignoring the deadline. That seems just asking for undiagnosable run-away RPCs. The graceful way to handle this would be to fail the RPC.

The RPC would fail automatically?

The RPC is not guaranteed to fail here. You're hoping that the rejection is related to channel termination. But it could have been that the queue was too long. Or it could be that the user shut down their executor before gRPC terminated.

Now I got what you concerned.

ejona86 · 2017-07-27T18:19:35Z

core/src/main/java/io/grpc/internal/ClientCallImpl.java

@@ -149,6 +151,26 @@ public void start(final Listener<RespT> observer, Metadata headers) {
    checkNotNull(observer, "observer");
    checkNotNull(headers, "headers");

+    if (deadlineCancellationExecutor == null) {


I'm not excited about having a very different code path during the race vs after (with this case being after). It makes it all the more difficult to find bugs during the race.

I'm okay with removing this code path.

ejona86 · 2017-07-27T18:32:00Z

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

@@ -554,6 +555,32 @@ public String authority() {
    }
  }

+  private final class DirectFallbackExecutor implements Executor {


This is quite an interesting idea, although coupled with the ClientCallImpl change, it will cause us to use the executor after shutdown more than previously. Previously we would never use the executor after shutdown; only the deadlineCancellationExecutor.

Seems previously we also would use the executor after shutdown: anything inside ClientCallImpl.start() may be happening after shutdown.

I thought that with FailingClientTransport it would be directly calling back the application, but I guess it does go through the listener which uses the callExecutor. The usages of callExecutor in start are rarely triggered.

ejona86 · 2017-07-27T21:25:30Z

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

+            command.run();
+          }
+        });
+      } catch (RejectedExecutionException e) {


Instead of relying on the rejection, can we set executor = null when terminated and then check it here?

In that case, we can check terminated directly, but still a race.
If we check executor =! null, and run with it, there could be a NPE because of the race.

NPE is not a concern:

Executor saved = executor; if (saved == null) { command.run(); } else { saved.execute(command); }

I agree there is still a race between checking executor and running with it, but it is dramatically smaller and controlled by our code instead of the executor.

In my mind the race is calling executor after the channel has terminated. None of these solutions actually fix that. They just mitigate it.

I agree that none of these solutions actually fix the race. But with the rejection check, the only unresolved race is very very benign as follows:
The executor.execute() is called and throws a RejectedExecutionException for whatever unlikely reason before the channel is terminated, and channel is terminated when we check it in the catch block, so we don't throw this exception, and call command.run() directly.

Other cases are handled gracefully.

I disagree. The rejection check continues to allow calls to executor. That's the race. executor may be provided by the application and it's not safe to call after termination. Or at least that's been my assumption.

I could believe the argument that we should always use the executor, even after termination, based on the theory that the application is expecting us to use executor (so it may be "magical") and it's the application's own fault for making an RPC after shutdown (not termination). But in that case we mustn't use direct if there was a RejectionExecutionException. I'm not sure how many options we have, but we'd maybe need to propagate the exception to the caller and we never call their listener.

But this seems to be neither of these. I'm not sure what API theory this change is based on.

ejona86 · 2017-07-27T21:25:59Z

core/src/main/java/io/grpc/internal/ManagedChannelImpl.java

@@ -535,7 +536,7 @@ public String authority() {
        CallOptions callOptions) {
      Executor executor = callOptions.getExecutor();
      if (executor == null) {
-        executor = ManagedChannelImpl.this.executor;
+        executor = new DirectFallbackExecutor();


Let's not create this every time.

dapengzhang0 · 2018-11-20T23:35:54Z

obsolete

dapengzhang0 requested a review from zhangkun83 July 27, 2017 17:40

dapengzhang0 force-pushed the fix1981 branch from 384de82 to 6f09430 Compare July 27, 2017 17:46

ejona86 requested changes Jul 27, 2017

View reviewed changes

dapengzhang0 force-pushed the fix1981 branch from 23f90a5 to 666b5ad Compare July 27, 2017 19:17

ejona86 reviewed Jul 27, 2017

View reviewed changes

dapengzhang0 force-pushed the fix1981 branch 8 times, most recently from c13a434 to 351e384 Compare August 7, 2017 16:43

dapengzhang0 changed the title ~~core: mitigate the race between channel termination and clientCallImpl start~~ core: fix the race between channel termination and clientCallImpl start Aug 7, 2017

dapengzhang0 force-pushed the fix1981 branch 3 times, most recently from 0727240 to 63d7d2f Compare August 7, 2017 21:23

dapengzhang0 changed the title ~~core: fix the race between channel termination and clientCallImpl start~~ core: fix the race between channel shutdown and clientCallImpl start Aug 8, 2017

dapengzhang0 force-pushed the fix1981 branch from f22378f to 2842061 Compare August 14, 2017 23:14

core: fix the race between channel shutdown and clientCallImpl start

d839724

dapengzhang0 force-pushed the fix1981 branch from 2842061 to d839724 Compare August 14, 2017 23:15

dapengzhang0 closed this Nov 20, 2018

lock bot locked as resolved and limited conversation to collaborators Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: fix the race between channel shutdown and clientCallImpl start #3287

core: fix the race between channel shutdown and clientCallImpl start #3287

dapengzhang0 commented Jul 27, 2017 •

edited

Loading

ejona86 left a comment

ejona86 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

ejona86 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 Jul 27, 2017 •

edited

Loading

ejona86 Jul 27, 2017

ejona86 Jul 27, 2017

dapengzhang0 commented Nov 20, 2018

core: fix the race between channel shutdown and clientCallImpl start #3287

core: fix the race between channel shutdown and clientCallImpl start #3287

Conversation

dapengzhang0 commented Jul 27, 2017 • edited Loading

ejona86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dapengzhang0 Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dapengzhang0 commented Nov 20, 2018

dapengzhang0 commented Jul 27, 2017 •

edited

Loading

dapengzhang0 Jul 27, 2017 •

edited

Loading