Detect transport executors with no remaining threads #11503

shivaspeaks · 2024-09-03T10:53:43Z

Created a way to detect insufficient threads to start the transport for read and write simultaneously.

Fixes #11271

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

ejona86 · 2024-09-10T14:37:03Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

@@ -499,6 +499,29 @@ public Runnable start(Listener listener) {
      outboundFlow = new OutboundFlowController(this, frameWriter);
    }
    final CountDownLatch latch = new CountDownLatch(1);
+    // This runs con-concurrently with handshake and works as a hack checking enough threads are


s/con-concurrently/concurrently/

ejona86 · 2024-09-10T14:41:15Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

@@ -499,6 +499,29 @@ public Runnable start(Listener listener) {
      outboundFlow = new OutboundFlowController(this, frameWriter);
    }
    final CountDownLatch latch = new CountDownLatch(1);
+    // This runs con-concurrently with handshake and works as a hack checking enough threads are


It doesn't check enough threads are available to start the transport. Maybe something closer to:

The transport needs up to two threads to function once started, but only needs one during handshaking. Start another thread during handshaking to make sure there's still a free thread available. If the number of threads is exhausted, it is better to kill the transport than for all the transports to hang unable to send.

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

More changes suggested in review.

panchenko · 2024-09-11T10:24:06Z

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

+            }
+            System.out.println("started goaway");
+            return;
+          }
          latch.await();


I would like to understand the reason for this previous latch.
as it seems, this code waits for sendConnectionPrefaceAndSettings execution, which happens below.
Can that be moved above this serializingExecutor.execute(new Runnable() ?
Have I missed anything?

Writes are performed by AsyncSink. When a thread wants to do a write, it is added to a queue and a Runnable is added to serializingExecutor. So the Runnable here is running on that "same thread" and the writes can't happen until this proceeds.

The question I had a bit earlier yesterday was, "why don't we do sendConnectionPrefaceAndSettings() in this runnable, instead of waiting on the latch?" This construction guarantees that the first things written after the TCP/TLS handshake is the HTTP/2 handshake. Back when this code was written, RPCs would be sent on transports before the transport went READY, so the sendConnectionPrefaceAndSettings() needed to be enqueued before start() returned.

The introduction of DelayedClientTransport avoided using transports before they were READY. TransportSet is known as InternalSubchannel today. The activeTransportFuture.set() right after start() was the main case that sent RPCs to transports before they were ready.
cf787bd#diff-c993808318f59c8a206c5b4f1af4fd2b3f81a0fca3662be8966c392e3829e430R200

I feel like it took some years to be confident we had gotten rid of all the places that assumed a transport could support RPCs immediately after being started. But that was the most important change in that direction. The change was prompted by repeated memory leaks when dealing with ListenableFutures, as you can't remove listeners from futures. And in the process we fixed assigning RPCs too eagerly to transports.

(There had been some debate at the time whether we were okay with the added latency. Waiting for transport ready actually delays RPCs from being sent on a new connection by a RTT, because it waits to receive HTTP/2 SETTINGS from the server.)

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java

panchenko · 2024-09-12T09:17:30Z

Is another runnable actually needed here? The existing ClientFrameHandler can be started earlier, for example like this:

@@ -164,7 +167,7 @@
   private final ScheduledExecutorService scheduler;
   private final int maxMessageSize;
   private int connectionUnacknowledgedBytesRead;
-  private ClientFrameHandler clientFrameHandler;
+  private final ClientFrameHandler clientFrameHandler = new ClientFrameHandler();
   // Caution: Not synchronized, new value can only be safely read after the connection is complete.
   private Attributes attributes;
   /**
@@ -574,7 +577,7 @@
           onException(e);
           return;
         } finally {
-          clientFrameHandler = new ClientFrameHandler(variant.newReader(source, true));
+          clientFrameHandler.readerAndStartSignal.add(variant.newReader(source, true));
         }
         synchronized (lock) {
           socket = Preconditions.checkNotNull(sock, "socket");
@@ -591,15 +594,18 @@
       latch.countDown();
     }
 
+    // ClientFrameHandler need to be started after connectionPreface / settings, otherwise it
+    // may send goAway immediately.
+    executor.execute(clientFrameHandler);
+
     serializingExecutor.execute(new Runnable() {
       @Override
       public void run() {
         if (connectingCallback != null) {
           connectingCallback.run();
         }
-        // ClientFrameHandler need to be started after connectionPreface / settings, otherwise it
-        // may send goAway immediately.
-        executor.execute(clientFrameHandler);
+        clientFrameHandler.started.await(10, TimeUnit.SECONDS); // TODO error handling
+        clientFrameHandler.readerAndStartSignal.add("START");
         synchronized (lock) {
           maxConcurrentStreams = Integer.MAX_VALUE;
           startPendingStreams();
@@ -1090,19 +1096,19 @@
 
     private final OkHttpFrameLogger logger =
         new OkHttpFrameLogger(Level.FINE, OkHttpClientTransport.class);
-    FrameReader frameReader;
+    final CountDownLatch started = new CountDownLatch(1);
+    final BlockingQueue<Object> readerAndStartSignal = new ArrayBlockingQueue<>(2);
     boolean firstSettings = true;
 
-    ClientFrameHandler(FrameReader frameReader) {
-      this.frameReader = frameReader;
-    }
-
     @Override
     @SuppressWarnings("Finally")
     public void run() {
+      started.countDown();
       String threadName = Thread.currentThread().getName();
       Thread.currentThread().setName("OkHttpClientTransport");
       try {
+        FrameReader frameReader = (FrameReader) readerAndStartSignal.poll(1, TimeUnit.MINUTES);
+        readerAndStartSignal.poll(1, TimeUnit.MINUTES);
         // Read until the underlying socket closes.
         while (frameReader.nextFrame(this)) {
           if (keepAliveManager != null) {

ejona86 · 2024-09-12T14:47:46Z

@panchenko, it's unclear what you're optimizing for. The error handling looks harder to get right with reusing the reader.

panchenko · 2024-09-12T15:38:55Z

@ejona86 starting an intermediate runnable does not guarantee there will be an available thread for clientFrameHandler, so I am thinking about starting exactly the runnable we need.

The error handling should be similar anyway.

ejona86 · 2024-09-12T17:14:43Z

starting an intermediate runnable does not guarantee there will be an available thread for clientFrameHandler, so I am thinking about starting exactly the runnable we need.

Oh, the concern isn't clientFrameHandler. If we can't get that thread, then at least the wedged transport is not impacting other transports. Also, we won't be able to receive the initial SETTINGS so the transport won't go READY... I guess we could have a timeout trigger in that case.

The concern is the serializingExecutor as used by AsyncSink. If we have N threads and start N transports, then clientFrameHandler will consume those N threads. But then no transport can actually send anything, as that uses an on-demand thread in serializingExecutor/AsyncSink.

ejona86

When merging, use the "squash" setting and make sure to clean up the commit message (the default when there are multiple commits is a list of the commit messages of your commits, which is essentially never what we'd want).

ejona86 · 2024-09-13T20:23:03Z

Also, if you say "Fixes #11271" then merging the PR automatically closes the issue. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword

Detect transport executors with no remaining threads

5ae7079

ejona86 reviewed Sep 5, 2024

View reviewed changes

shivaspeaks commented Sep 5, 2024

View reviewed changes

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java Outdated Show resolved Hide resolved

OkHttp: Detect transport executors with no remaining threads

1a56f71

shivaspeaks requested a review from kannanjgithub September 9, 2024 08:17

kannanjgithub reviewed Sep 9, 2024

View reviewed changes

okhttp/src/main/java/io/grpc/okhttp/OkHttpClientTransport.java Outdated Show resolved Hide resolved

kannanjgithub previously approved these changes Sep 9, 2024

View reviewed changes

OkHttp: Detect transport executors with no remaining threads

a6201c5

shivaspeaks requested a review from ejona86 September 10, 2024 03:17

ejona86 reviewed Sep 10, 2024

View reviewed changes

shivaspeaks added 2 commits September 11, 2024 06:17

Resolving blockers

55f567c

Resolving blockers

de44623

panchenko reviewed Sep 11, 2024

View reviewed changes

ejona86 reviewed Sep 11, 2024

View reviewed changes

Detect transport executors with no remaining threads

10fbc53

shivaspeaks requested a review from ejona86 September 11, 2024 18:22

ejona86 reviewed Sep 11, 2024

View reviewed changes

Detect transport executors with no remaining threads

2fd24b6

shivaspeaks requested a review from ejona86 September 13, 2024 12:02

Detect transport executors with no remaining threads

4d44e3d

ejona86 approved these changes Sep 13, 2024

View reviewed changes

shivaspeaks merged commit 3a6be9c into grpc:master Sep 16, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect transport executors with no remaining threads #11503

Detect transport executors with no remaining threads #11503

shivaspeaks commented Sep 3, 2024 •

edited

Loading

ejona86 Sep 10, 2024

ejona86 Sep 10, 2024

panchenko Sep 11, 2024

ejona86 Sep 11, 2024

ejona86 Sep 11, 2024

panchenko commented Sep 12, 2024

ejona86 commented Sep 12, 2024

panchenko commented Sep 12, 2024

ejona86 commented Sep 12, 2024

ejona86 left a comment

ejona86 commented Sep 13, 2024

Detect transport executors with no remaining threads #11503

Detect transport executors with no remaining threads #11503

Conversation

shivaspeaks commented Sep 3, 2024 • edited Loading

ejona86 Sep 10, 2024

Choose a reason for hiding this comment

ejona86 Sep 10, 2024

Choose a reason for hiding this comment

panchenko Sep 11, 2024

Choose a reason for hiding this comment

ejona86 Sep 11, 2024

Choose a reason for hiding this comment

ejona86 Sep 11, 2024

Choose a reason for hiding this comment

panchenko commented Sep 12, 2024

ejona86 commented Sep 12, 2024

panchenko commented Sep 12, 2024

ejona86 commented Sep 12, 2024

ejona86 left a comment

Choose a reason for hiding this comment

ejona86 commented Sep 13, 2024

shivaspeaks commented Sep 3, 2024 •

edited

Loading