channel warm up per thread #1758

LeeYeeze · 2016-05-01T06:53:42Z

Hi, all. I'm doing an experiment with grpc and find out that the first usage of grpc in a thread costs much more time than the subsequent ones doing the same thing. I tested it on a 2-core laptop and 24-core machines, and the same phenomena occurred on all of them. So I'm wondering if I'm using grpc wrong or something in grpc-java could be improved to avoid such situation. Below is my code and part of the report.

import java.text.SimpleDateFormat;
import java.util.Date;

import me.gdbc.mocker.MockServer;
import me.proto.query.QueryProtocol.QueryRequest;
import me.proto.query.QueryProtocol.QueryResponse;

import me.proto.query.QueryServiceInterfaceGrpc.METHOD_QUERY;

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.netty.NettyChannelBuilder;

public class WarmUp {

public static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

public static QueryRequest request = QueryRequest.newBuilder().setQueryString("").build();
public static class QueryThread extends Thread {

    private String name;
    private ManagedChannel client;

    public QueryThread(String name, ManagedChannel client) {
        this.name = name;
        this.client = client;
    }

    public void run() {
        for (int i = 0 ; i < 300; i++) {
            long start = System.currentTimeMillis();
            try {
                //ClientCall<QueryRequest, QueryResponse> call =  client.newCall(METHOD_QUERY, CallOptions.DEFAULT);
                QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(client, METHOD_QUERY, CallOptions.DEFAULT, request);
                //QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(call, request);
                long duration = System.currentTimeMillis() - start;
                String show  = "" + ret + "\n" + df.format(new Date()) + "\n" + "cost " + duration + "ms in thread " + name;
                System.out.println(show);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }   
}

public static void main(String[] args) {
    int port = 9012;
    MockServer server = new MockServer(port);
    try {
        server.start();
        ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", port).usePlaintext(true).build();
        io.grpc.stub.ClientCalls.blockingUnaryCall(channel, METHOD_QUERY, CallOptions.DEFAULT, request);
        for (int i = 0; i < 4; i++) {
            QueryThread th = new QueryThread(String.valueOf(i), channel);
            th.start();

        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}

}

May 01, 2016 2:28:09 PM me.gdbc.mocker.MockServer start
INFO: Server started, listenning on 9012
May 01, 2016 2:28:09 PM io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@326de728] Created with target 127.0.0.1:9012
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 76ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 67ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 68ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 66ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 4ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}

The text was updated successfully, but these errors were encountered:

carl-mastrangelo · 2016-05-02T15:30:39Z

A few things:

You should measure this kind of time with System.nanotime().
The JVM takes a large number of iterations before it will compile JIT'd code. This can be as many as 10K iterations, which it doesn't look like your test hits.
Please run the benchmark code (in benchmarks/...) to get an idea of what a reasonable latency is.

LeeYeeze · 2016-05-02T16:07:23Z

Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.

makdharma · 2016-05-02T16:08:29Z

Could it be because the actual connection is only made when the first call
comes through, and subsequent calls don't pay the connection setup penalty?

On Mon, May 2, 2016 at 9:07 AM, Yize Li notifications@github.com wrote:

Hi carl. I think this is not a bench test. I'm not concerned about
reasonable latency now. What bothers me is the huge difference in latencies
between first call of grpc and the following ones in the same thread.
Usually the first one takes about 100ms and then the subsequent ones only
use about 5ms to do the same thing( the difference is clear even if not
measured in nanoseconds). And it seems not a problem with JIT since the
test does not hit it at all.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1758 (comment)

ejona86 · 2016-05-02T16:10:41Z

@makdharma, that seems very likely.

@LeeYeeze, the point is that you aren't comparing "first RPC on a thread" and instead saying "gRPC is slow to start". Since each thread is started at the same time I'd expect them to each have similar performance. In order to show it was creation a new thread that was slow you'd need to create new threads throughout the duration of the test.

LeeYeeze · 2016-05-02T16:22:28Z

@makdharma @ejona86
I invoke a blockingUnaryCall before I start all the threads, so the connection is already setup. I also test it with threads separated in time throughout the duration of the test, almost the same except that without contention the first call cost less time but still slow. My laptop is not at hand right now, so I will paste my report later here.

ejona86 · 2016-05-02T18:16:13Z

@LeeYeeze, I've reproduced what you see. I'm on Linux. Swapping my CPU governor from ondemand to performance didn't change behavior much.

I modified your code, hitting the hello-world-server:

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.examples.helloworld.GreeterGrpc;
import io.grpc.examples.helloworld.HelloReply;
import io.grpc.examples.helloworld.HelloRequest;
import io.grpc.netty.NettyChannelBuilder;
import io.grpc.stub.ClientCalls;

import java.util.Date;
import java.util.concurrent.TimeUnit;

public class ThreadWarmUp {
  private static final HelloRequest request = HelloRequest.newBuilder().setName("world").build();

  public static class QueryThread extends Thread {
    private String name;
    private ManagedChannel channel;

    public QueryThread(String name, ManagedChannel channel) {
      this.name = name;
      this.channel = channel;
    }

    public void run() {
      final int type = 0;
      for (int i = 0 ; i < 3; i++) {
        long start = System.nanoTime();
        try {
          if (type == 0) {
            ClientCalls.blockingUnaryCall(
                channel, GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT, request);
          } else if (type == 1) {
            ClientCalls.futureUnaryCall(
                channel.newCall(GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT), request).get();
          }
          long duration = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - start);
          String show  = "thread " + name + "\tcost " + duration + "µs";
          System.out.println(show);
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    }
  }

  public static void main(String[] args) throws Exception {
    ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", 50051)
        .usePlaintext(true).build();
    GreeterGrpc.newBlockingStub(channel).sayHello(request);
    for (int i = 0; i < 100; i++) {
      QueryThread th = new QueryThread(String.valueOf(i), channel);
      th.start();
      Thread.sleep(100);
    }
  }
}

It results in things like:

thread 0    cost 7718µs
thread 0    cost 1816µs
thread 0    cost 1743µs
thread 1    cost 7718µs
thread 1    cost 1850µs
thread 1    cost 1852µs
thread 2    cost 7914µs
thread 2    cost 1879µs
thread 2    cost 1729µs
thread 3    cost 8246µs
thread 3    cost 2064µs
thread 3    cost 1875µs
...
thread 96   cost 1507µs
thread 96   cost 1127µs
thread 96   cost 1066µs
thread 97   cost 1815µs
thread 97   cost 1169µs
thread 97   cost 1038µs
thread 98   cost 1712µs
thread 98   cost 960µs
thread 98   cost 902µs
thread 99   cost 1703µs
thread 99   cost 1105µs
thread 99   cost 1234µs

LeeYeeze · 2016-05-04T13:33:23Z

@ejona86
I find out that the main time consumer of this thread warm up is the allocate method in NettyWritableBufferAllocator. Specifically, it's the invoking of allocator.buffer() inside this method that is responsible for the delay.

tobegit3hub · 2017-04-28T01:37:21Z

We have the similar project when using TensorFlow Serving with Java gRPC clients. Is there any way to fix that? @ejona86

ferrari6666 · 2019-10-04T09:46:41Z

Please let us know any fix on this issue or workaround. @ejona86

zxiang179 · 2020-10-16T05:53:17Z

I wonder if the lastest grpc java framework have fixed this problem ?

gary-lo · 2021-02-20T01:29:51Z

Any updates on this? @ejona86

Curious if this is similar to #2971

zzxgzgz · 2021-11-03T23:49:26Z

Any update on this? @ejona86

I'm also experiencing very similar issue. When using a gRPC channel and async stub to send data, the first few onNext(data) takes a lot of time(~15ms), but afterwards it takes much less time (~2 to 3 ms).

I'm looking forward to make all of them take as little time as possible. Do you have any suggestion on that?

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

Thank you

inkinworld · 2022-02-24T06:13:32Z

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup.
cause grpc need netty allocate buffer memory, the dummy message should big than real message.

zzxgzgz · 2022-02-24T17:37:52Z

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.

@inkinworld thank you for sharing. Let me try that.

josephykwang · 2022-03-29T19:42:42Z

any reference implementation for constructing large dummy warmup message?

ejona86 · 2022-03-29T21:25:06Z

Make a message with a bytes field, and then just stick a byte array of the size you want. The encoded protobuf will be a few bytes larger than that byte array.

313hemant313 · 2023-07-13T09:19:50Z

Any reference implementation for constructing large dummy warmup message?

adiharel30 · 2024-09-03T14:38:18Z

Does anyone have any updates regarding this? My team still faces this issue and we haven't found any workaround...

tommyulfsparre · 2025-01-28T08:13:57Z

Hey @ejona86 you might be able to shed to some light on what we are observing (or if this is expected or not).

We "pre-warm" all channels by sending a configurable number (10 in this case) of Health Checking RPCs where the timing look something like this:

Preload finished after: 492 ms
Preload finished after: 84 ms
Preload finished after: 86 ms
Preload finished after: 80 ms
Preload finished after: 85 ms
Preload finished after: 88 ms
Preload finished after: 83 ms
Preload finished after: 82 ms
Preload finished after: 82 ms
Preload finished after: 81 ms

We assume that the channel should be sufficiently pre-warmed in this case. I even tried to abuse the service field to construct a arbitrary large payload.

But we still take a latency hit on the very first RPC using the same channel but with a different RPC method:

First:  235 ms
Second: 145 ms

ejona86 · 2025-01-28T16:29:30Z

@tommyulfsparre, the first time you call a method in a JVM will be slower, because of the JIT. This issue is about the Netty ByteBuf allocation performance on a new thread. If, on the same thread, you did a similarly-sized RPC before the RPC you care about and still see a performance difference, then it seems less likely to be this issue.

tommyulfsparre · 2025-01-28T17:07:00Z

Thanks @ejona86, yes im well aware of the effect of the JIT 😃 .

We do the warmup from a single thread but using the same channel, the RPC we actual care about is not guaranteed to run on the same thread so the ByteBuf allocation performance for new threads seems plausible. Do you think there is a way to do any form of pre-allocation here or other ways we could achieve the same things? We would happily trade some CPU/memory during startup if we could mitigate the initial latency hit.

ejona86 · 2025-01-28T18:16:08Z

@tommyulfsparre, I still don't think you're seeing the same problem. You're seeing it take 90 ms longer. I reproduced ~600 µs in this issue. For a larger message that could be larger, but not 90 ms.

Client classloading, server classloading, heap size not stabilized, and server-side caches (including auth) are much more likely for that level of delay.

tommyulfsparre · 2025-01-28T20:35:01Z

@ejona86 Agree although I was not as concerned about the absolut number as this was measuring a remote RPC but more the observation that the initial RPC (after pre-warming which should have included JIT and other things) seems significantly slower.

I did try to reproduce the observation here by running the client after server preloading. Although I can't explain why this RPC is slower than subsequent once the diff isn't really a concern.

ejona86 added the performance label Jun 9, 2016

ejona86 added this to the Unscheduled milestone Jun 9, 2016

weberxie mentioned this issue Dec 12, 2017

Performance issue on CPU in comparison with Tensorflow model inference tensorflow/serving#670

Closed

ejona86 mentioned this issue Apr 6, 2018

Handle run-away netty buffer thread-local caches #4317

Open

313hemant313 mentioned this issue Jul 13, 2023

High latency on first grpc call grpc-ecosystem/grpc-spring#931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channel warm up per thread #1758

channel warm up per thread #1758

LeeYeeze commented May 1, 2016 •

edited

Loading

carl-mastrangelo commented May 2, 2016

LeeYeeze commented May 2, 2016

makdharma commented May 2, 2016

ejona86 commented May 2, 2016

LeeYeeze commented May 2, 2016 •

edited

Loading

ejona86 commented May 2, 2016

LeeYeeze commented May 4, 2016 •

edited

Loading

tobegit3hub commented Apr 28, 2017

ferrari6666 commented Oct 4, 2019 •

edited

Loading

zxiang179 commented Oct 16, 2020 •

edited

Loading

gary-lo commented Feb 20, 2021 •

edited

Loading

zzxgzgz commented Nov 3, 2021

inkinworld commented Feb 24, 2022 •

edited

Loading

zzxgzgz commented Feb 24, 2022

josephykwang commented Mar 29, 2022

ejona86 commented Mar 29, 2022

313hemant313 commented Jul 13, 2023

adiharel30 commented Sep 3, 2024

tommyulfsparre commented Jan 28, 2025

ejona86 commented Jan 28, 2025

tommyulfsparre commented Jan 28, 2025

ejona86 commented Jan 28, 2025

tommyulfsparre commented Jan 28, 2025

channel warm up per thread #1758

channel warm up per thread #1758

Comments

LeeYeeze commented May 1, 2016 • edited Loading

carl-mastrangelo commented May 2, 2016

LeeYeeze commented May 2, 2016

makdharma commented May 2, 2016

ejona86 commented May 2, 2016

LeeYeeze commented May 2, 2016 • edited Loading

ejona86 commented May 2, 2016

LeeYeeze commented May 4, 2016 • edited Loading

tobegit3hub commented Apr 28, 2017

ferrari6666 commented Oct 4, 2019 • edited Loading

zxiang179 commented Oct 16, 2020 • edited Loading

gary-lo commented Feb 20, 2021 • edited Loading

zzxgzgz commented Nov 3, 2021

inkinworld commented Feb 24, 2022 • edited Loading

zzxgzgz commented Feb 24, 2022

josephykwang commented Mar 29, 2022

ejona86 commented Mar 29, 2022

313hemant313 commented Jul 13, 2023

adiharel30 commented Sep 3, 2024

tommyulfsparre commented Jan 28, 2025

ejona86 commented Jan 28, 2025

tommyulfsparre commented Jan 28, 2025

ejona86 commented Jan 28, 2025

tommyulfsparre commented Jan 28, 2025

LeeYeeze commented May 1, 2016 •

edited

Loading

LeeYeeze commented May 2, 2016 •

edited

Loading

LeeYeeze commented May 4, 2016 •

edited

Loading

ferrari6666 commented Oct 4, 2019 •

edited

Loading

zxiang179 commented Oct 16, 2020 •

edited

Loading

gary-lo commented Feb 20, 2021 •

edited

Loading

inkinworld commented Feb 24, 2022 •

edited

Loading