Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channel warm up per thread #1758

Open
LeeYeeze opened this issue May 1, 2016 · 23 comments
Open

channel warm up per thread #1758

LeeYeeze opened this issue May 1, 2016 · 23 comments

Comments

@LeeYeeze
Copy link

LeeYeeze commented May 1, 2016

Hi, all. I'm doing an experiment with grpc and find out that the first usage of grpc in a thread costs much more time than the subsequent ones doing the same thing. I tested it on a 2-core laptop and 24-core machines, and the same phenomena occurred on all of them. So I'm wondering if I'm using grpc wrong or something in grpc-java could be improved to avoid such situation. Below is my code and part of the report.

import java.text.SimpleDateFormat;
import java.util.Date;

import me.gdbc.mocker.MockServer;
import me.proto.query.QueryProtocol.QueryRequest;
import me.proto.query.QueryProtocol.QueryResponse;

import me.proto.query.QueryServiceInterfaceGrpc.METHOD_QUERY;

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.netty.NettyChannelBuilder;

public class WarmUp {

public static SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

public static QueryRequest request = QueryRequest.newBuilder().setQueryString("").build();
public static class QueryThread extends Thread {

    private String name;
    private ManagedChannel client;

    public QueryThread(String name, ManagedChannel client) {
        this.name = name;
        this.client = client;
    }

    public void run() {
        for (int i = 0 ; i < 300; i++) {
            long start = System.currentTimeMillis();
            try {
                //ClientCall<QueryRequest, QueryResponse> call =  client.newCall(METHOD_QUERY, CallOptions.DEFAULT);
                QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(client, METHOD_QUERY, CallOptions.DEFAULT, request);
                //QueryResponse ret = io.grpc.stub.ClientCalls.blockingUnaryCall(call, request);
                long duration = System.currentTimeMillis() - start;
                String show  = "" + ret + "\n" + df.format(new Date()) + "\n" + "cost " + duration + "ms in thread " + name;
                System.out.println(show);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }   
}

public static void main(String[] args) {
    int port = 9012;
    MockServer server = new MockServer(port);
    try {
        server.start();
        ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", port).usePlaintext(true).build();
        io.grpc.stub.ClientCalls.blockingUnaryCall(channel, METHOD_QUERY, CallOptions.DEFAULT, request);
        for (int i = 0; i < 4; i++) {
            QueryThread th = new QueryThread(String.valueOf(i), channel);
            th.start();

        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}

}
May 01, 2016 2:28:09 PM me.gdbc.mocker.MockServer start
INFO: Server started, listenning on 9012
May 01, 2016 2:28:09 PM io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@326de728] Created with target 127.0.0.1:9012
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 76ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 67ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 68ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 66ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 8ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 10ms in thread 2
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 4ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 7ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 1
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 0
result {
  props {
    intVal: 1
  }
}

2016-05-01 14:28:10
cost 6ms in thread 3
result {
  props {
    intVal: 1
  }
}
@carl-mastrangelo
Copy link
Contributor

A few things:

  • You should measure this kind of time with System.nanotime().
  • The JVM takes a large number of iterations before it will compile JIT'd code. This can be as many as 10K iterations, which it doesn't look like your test hits.
  • Please run the benchmark code (in benchmarks/...) to get an idea of what a reasonable latency is.

@LeeYeeze
Copy link
Author

LeeYeeze commented May 2, 2016

Hi carl. I think this is not a bench test. I'm not concerned about reasonable latency now. What bothers me is the huge difference in latencies between first call of grpc and the following ones in the same thread. Usually the first one takes about 100ms and then the subsequent ones only use about 5ms to do the same thing( the difference is clear even if not measured in nanoseconds). And it seems not a problem with JIT since the test does not hit it at all.

@makdharma
Copy link

Could it be because the actual connection is only made when the first call
comes through, and subsequent calls don't pay the connection setup penalty?

On Mon, May 2, 2016 at 9:07 AM, Yize Li notifications@github.com wrote:

Hi carl. I think this is not a bench test. I'm not concerned about
reasonable latency now. What bothers me is the huge difference in latencies
between first call of grpc and the following ones in the same thread.
Usually the first one takes about 100ms and then the subsequent ones only
use about 5ms to do the same thing( the difference is clear even if not
measured in nanoseconds). And it seems not a problem with JIT since the
test does not hit it at all.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1758 (comment)

@ejona86
Copy link
Member

ejona86 commented May 2, 2016

@makdharma, that seems very likely.

@LeeYeeze, the point is that you aren't comparing "first RPC on a thread" and instead saying "gRPC is slow to start". Since each thread is started at the same time I'd expect them to each have similar performance. In order to show it was creation a new thread that was slow you'd need to create new threads throughout the duration of the test.

@LeeYeeze
Copy link
Author

LeeYeeze commented May 2, 2016

@makdharma @ejona86
I invoke a blockingUnaryCall before I start all the threads, so the connection is already setup. I also test it with threads separated in time throughout the duration of the test, almost the same except that without contention the first call cost less time but still slow. My laptop is not at hand right now, so I will paste my report later here.

@ejona86
Copy link
Member

ejona86 commented May 2, 2016

@LeeYeeze, I've reproduced what you see. I'm on Linux. Swapping my CPU governor from ondemand to performance didn't change behavior much.

I modified your code, hitting the hello-world-server:

import io.grpc.CallOptions;
import io.grpc.ManagedChannel;
import io.grpc.examples.helloworld.GreeterGrpc;
import io.grpc.examples.helloworld.HelloReply;
import io.grpc.examples.helloworld.HelloRequest;
import io.grpc.netty.NettyChannelBuilder;
import io.grpc.stub.ClientCalls;

import java.util.Date;
import java.util.concurrent.TimeUnit;

public class ThreadWarmUp {
  private static final HelloRequest request = HelloRequest.newBuilder().setName("world").build();

  public static class QueryThread extends Thread {
    private String name;
    private ManagedChannel channel;

    public QueryThread(String name, ManagedChannel channel) {
      this.name = name;
      this.channel = channel;
    }

    public void run() {
      final int type = 0;
      for (int i = 0 ; i < 3; i++) {
        long start = System.nanoTime();
        try {
          if (type == 0) {
            ClientCalls.blockingUnaryCall(
                channel, GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT, request);
          } else if (type == 1) {
            ClientCalls.futureUnaryCall(
                channel.newCall(GreeterGrpc.METHOD_SAY_HELLO, CallOptions.DEFAULT), request).get();
          }
          long duration = TimeUnit.NANOSECONDS.toMicros(System.nanoTime() - start);
          String show  = "thread " + name + "\tcost " + duration + "µs";
          System.out.println(show);
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    }
  }

  public static void main(String[] args) throws Exception {
    ManagedChannel channel = NettyChannelBuilder.forAddress("127.0.0.1", 50051)
        .usePlaintext(true).build();
    GreeterGrpc.newBlockingStub(channel).sayHello(request);
    for (int i = 0; i < 100; i++) {
      QueryThread th = new QueryThread(String.valueOf(i), channel);
      th.start();
      Thread.sleep(100);
    }
  }
}

It results in things like:

thread 0    cost 7718µs
thread 0    cost 1816µs
thread 0    cost 1743µs
thread 1    cost 7718µs
thread 1    cost 1850µs
thread 1    cost 1852µs
thread 2    cost 7914µs
thread 2    cost 1879µs
thread 2    cost 1729µs
thread 3    cost 8246µs
thread 3    cost 2064µs
thread 3    cost 1875µs
...
thread 96   cost 1507µs
thread 96   cost 1127µs
thread 96   cost 1066µs
thread 97   cost 1815µs
thread 97   cost 1169µs
thread 97   cost 1038µs
thread 98   cost 1712µs
thread 98   cost 960µs
thread 98   cost 902µs
thread 99   cost 1703µs
thread 99   cost 1105µs
thread 99   cost 1234µs

@LeeYeeze
Copy link
Author

LeeYeeze commented May 4, 2016

@ejona86
I find out that the main time consumer of this thread warm up is the allocate method in NettyWritableBufferAllocator. Specifically, it's the invoking of allocator.buffer() inside this method that is responsible for the delay.

@ejona86 ejona86 added this to the Unscheduled milestone Jun 9, 2016
@tobegit3hub
Copy link

We have the similar project when using TensorFlow Serving with Java gRPC clients. Is there any way to fix that? @ejona86

@ferrari6666
Copy link

ferrari6666 commented Oct 4, 2019

Please let us know any fix on this issue or workaround. @ejona86

@zxiang179
Copy link

zxiang179 commented Oct 16, 2020

I wonder if the lastest grpc java framework have fixed this problem ?

@gary-lo
Copy link

gary-lo commented Feb 20, 2021

Any updates on this? @ejona86

Curious if this is similar to #2971

@zzxgzgz
Copy link

zzxgzgz commented Nov 3, 2021

Any update on this? @ejona86

I'm also experiencing very similar issue. When using a gRPC channel and async stub to send data, the first few onNext(data) takes a lot of time(~15ms), but afterwards it takes much less time (~2 to 3 ms).

I'm looking forward to make all of them take as little time as possible. Do you have any suggestion on that?

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

Thank you

@inkinworld
Copy link

inkinworld commented Feb 24, 2022

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup.
cause grpc need netty allocate buffer memory, the dummy message should big than real message.

@zzxgzgz
Copy link

zzxgzgz commented Feb 24, 2022

I also try to "warmup" the channel by sending a dummy message before the real work, but it doesn't seem to help. Am I doing it incorrectly?

@zzxgzgz I also do same thing, but it's working . Try to construct a large dummy enough message for warmup. cause grpc need netty allocate buffer memory, the dummy message should big than real message.

@inkinworld thank you for sharing. Let me try that.

@josephykwang
Copy link

any reference implementation for constructing large dummy warmup message?

@ejona86
Copy link
Member

ejona86 commented Mar 29, 2022

Make a message with a bytes field, and then just stick a byte array of the size you want. The encoded protobuf will be a few bytes larger than that byte array.

@313hemant313
Copy link

Any reference implementation for constructing large dummy warmup message?

@adiharel30
Copy link

Does anyone have any updates regarding this? My team still faces this issue and we haven't found any workaround...

@tommyulfsparre
Copy link

Hey @ejona86 you might be able to shed to some light on what we are observing (or if this is expected or not).

We "pre-warm" all channels by sending a configurable number (10 in this case) of Health Checking RPCs where the timing look something like this:

Preload finished after: 492 ms
Preload finished after: 84 ms
Preload finished after: 86 ms
Preload finished after: 80 ms
Preload finished after: 85 ms
Preload finished after: 88 ms
Preload finished after: 83 ms
Preload finished after: 82 ms
Preload finished after: 82 ms
Preload finished after: 81 ms

We assume that the channel should be sufficiently pre-warmed in this case. I even tried to abuse the service field to construct a arbitrary large payload.

But we still take a latency hit on the very first RPC using the same channel but with a different RPC method:

First:  235 ms
Second: 145 ms

@ejona86
Copy link
Member

ejona86 commented Jan 28, 2025

@tommyulfsparre, the first time you call a method in a JVM will be slower, because of the JIT. This issue is about the Netty ByteBuf allocation performance on a new thread. If, on the same thread, you did a similarly-sized RPC before the RPC you care about and still see a performance difference, then it seems less likely to be this issue.

@tommyulfsparre
Copy link

Thanks @ejona86, yes im well aware of the effect of the JIT 😃 .

We do the warmup from a single thread but using the same channel, the RPC we actual care about is not guaranteed to run on the same thread so the ByteBuf allocation performance for new threads seems plausible. Do you think there is a way to do any form of pre-allocation here or other ways we could achieve the same things? We would happily trade some CPU/memory during startup if we could mitigate the initial latency hit.

@ejona86
Copy link
Member

ejona86 commented Jan 28, 2025

@tommyulfsparre, I still don't think you're seeing the same problem. You're seeing it take 90 ms longer. I reproduced ~600 µs in this issue. For a larger message that could be larger, but not 90 ms.

Client classloading, server classloading, heap size not stabilized, and server-side caches (including auth) are much more likely for that level of delay.

@tommyulfsparre
Copy link

@ejona86 Agree although I was not as concerned about the absolut number as this was measuring a remote RPC but more the observation that the initial RPC (after pre-warming which should have included JIT and other things) seems significantly slower.

I did try to reproduce the observation here by running the client after server preloading. Although I can't explain why this RPC is slower than subsequent once the diff isn't really a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests