server performance issues #3826

xiaoshuang-lu · 2017-12-04T07:35:31Z

Hi grpc-java guys, I have wrote some programs to do the performance testing for grpc-java. The service is very simple, just echo the strings (about 400 bytes per request) received from clients. (Actually, I disabled servers' log.) Appears that it's is difficult for me to archive high requests-per-second and low response-time.

requests-per-second: about 220k (expected: 1000MB/400B)
reponse-time: p99 25ms, p999 33ms (expected: p99 below 4ms)

My environment and configurations are listed as follows.

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g
bossGroup: 4 threads
workerGroup: 160 threads
executor: ThreadPoolExecutor(64, 64, 1000 * 60 * 10, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(65536))

threaddump.txt

According to the thread dump, many worker threads stick to SerializingExecutor.

anhldbk · 2017-12-04T08:11:05Z

@xiaoshuang-lu would you mind sharing the code to construct your gRPC server? The following is mine and it works well:

        Executor executor = Utils.getDefaultExecutor(threadNum);
        bossEventGroup = new EpollEventLoopGroup();
        workerEventGroup = new EpollEventLoopGroup();
        server = NettyServerBuilder.forPort(port)
                .addService(
                        ServerInterceptors.intercept(
                                new Service(backendFutureStub),
                                new AuthInterceptor()
                        )
                )
                .bossEventLoopGroup(bossEventGroup)
                .workerEventLoopGroup(workerEventGroup)
                .channelType(EpollServerSocketChannel.class)
                .executor(executor)
                .build();

xiaoshuang-lu · 2017-12-04T08:14:11Z

Server

        Class<? extends ServerChannel> channelClass;
        if (RuntimeUtils.isLinux() && this.config.getLeverageEPoll()) {
            this.bossGroup =
                new EpollEventLoopGroup(
                    this.config.getAcceptorThreadNumber(),
                    new AtomThreadFactory("ServerAcceptor"));
            this.workerGroup =
                new EpollEventLoopGroup(
                    this.config.getWorkerThreadNumber(),
                    new AtomThreadFactory("ServerWorker"));

            channelClass = EpollServerSocketChannel.class;
        } else {
            this.bossGroup =
                new NioEventLoopGroup(
                    this.config.getAcceptorThreadNumber(),
                    new AtomThreadFactory("ServerAcceptor"));
            this.workerGroup =
                new NioEventLoopGroup(
                    this.config.getWorkerThreadNumber(),
                    new AtomThreadFactory("ServerWorker"));

            channelClass = NioServerSocketChannel.class;
        }

        if (this.config.getBusinessLogicThreadType() == BusinessLogicThreadType.DEDICATED) {
            ThreadPoolExecutor threadPoolExecutor =
                new ThreadPoolExecutor(
                    this.config.getThreadPoolCoreSize(),
                    this.config.getThreadPoolMaxSize(),
                    this.config.getThreadPoolKeepAliveTime(),
                    TimeUnit.MILLISECONDS,
                    new LinkedBlockingQueue<Runnable>(this.config.getThreadPoolQueueSize()),
                    new AtomThreadFactory("BusinessLogicThread"));

            threadPoolExecutor.allowCoreThreadTimeOut(this.config.getThreadPoolAllowCoreThreadTimeout());

            threadPoolExecutor.prestartAllCoreThreads();

            if (this.config.getMetricEnabled() && this.metricRegistry != null) {
                MetricUtils.prepareThreadPoolExecutorMetricItems(this.metricRegistry, threadPoolExecutor);
            }

            this.executorService = threadPoolExecutor;
        }

        this.serverBuilder = NettyServerBuilder.forPort(this.config.getPort());

        NettyServerBuilder nettyServerBuilder = (NettyServerBuilder)(this.serverBuilder);

        nettyServerBuilder.bossEventLoopGroup(this.bossGroup);
        nettyServerBuilder.workerEventLoopGroup(this.workerGroup);
        nettyServerBuilder.channelType(channelClass);
        nettyServerBuilder.executor(this.executorService);

        nettyServerBuilder.flowControlWindow(this.config.getHttp2SettingsInitialWindowSize());
         
nettyServerBuilder.maxConcurrentCallsPerConnection(this.config.getHttp2SettingsMaxConcurrentStreams());
        nettyServerBuilder.maxHeaderListSize(this.config.getHttp2SettingsMaxHeaderListSize());
        nettyServerBuilder.keepAliveTime(this.config.getPingInterval(), TimeUnit.MILLISECONDS);
        nettyServerBuilder.keepAliveTimeout(this.config.getPingTimeout(), TimeUnit.MILLISECONDS);
        if (this.config.getTlsEnabled()) {
            //nettyServerBuilder.sslContext(this.sslContext);
        }

        for (int i = 0; i < this.serviceDefinitionList.size(); ++ i) {
            nettyServerBuilder.addService(
                (BindableService)(this.serviceDefinitionList.get(i).getServiceImpl()));
        }
        MetricsServerTransportFilter metricsServerTransportFilter = new MetricsServerTransportFilter(this.metricRegistry);

        this.serverBuilder.addTransportFilter(metricsServerTransportFilter);

        MetricsServerStreamTracerFactory metricsServerStreamTracerFactory = new MetricsServerStreamTracerFactory();

        metricsServerStreamTracerFactory.setStopwatchSupplier(GrpcUtil.STOPWATCH_SUPPLIER);
        metricsServerStreamTracerFactory.setSlowCallTimeThreshold(this.config.getSlowCallTimeThreshold());
        metricsServerStreamTracerFactory.setMetricRegistry(this.metricRegistry);

        this.serverBuilder.addStreamTracerFactory(metricsServerStreamTracerFactory);

        for (ServerTransportFilter transportFilter : this.transportFilterList) {
            this.serverBuilder.addTransportFilter(transportFilter);
        }

        for (ServerInterceptor interceptor : this.interceptorList) {
            this.serverBuilder.intercept(interceptor);
        }

        for (ServerStreamTracer.Factory streamTracerFactory : this.streamTracerFactoryList) {
            this.serverBuilder.addStreamTracerFactory(streamTracerFactory);
        }

Service

    @Override
    public void test(
        GRPCPerformanceTestingService.PerformanceTestingServiceRequest request,
        StreamObserver<GRPCPerformanceTestingService.PerformanceTestingServiceResponse> streamObserver) {
        if (LOGGER.isDebugEnabled()) {
            LOGGER.debug(request.getData());
        }

        GRPCPerformanceTestingService.PerformanceTestingServiceResponse response =
            GRPCPerformanceTestingService.
            PerformanceTestingServiceResponse.
            newBuilder().
            setData(request.getData()).
            build();

        streamObserver.onNext(response);

        streamObserver.onCompleted();
    }

carl-mastrangelo · 2017-12-04T22:16:00Z

@xiaoshuang-lu Can you try using ForkJoinPool instead of ThreadPoolExecutor? It's what we use for gRPC benchmarks because it has better behavior under thread contention.

xiaoshuang-lu · 2017-12-05T01:53:39Z

@carl-mastrangelo If ForkJoinPool is used, do we probably exhaust threads?

carl-mastrangelo · 2017-12-05T01:56:45Z

@xiaoshuang-lu I think they are created as needed. They don't go above the desired parallelism unless you use ManagedBlocker. (this is rare).

xiaoshuang-lu · 2017-12-05T08:50:14Z

Thread dump indicates that most netty workers are in SerializingExecutor.execute().

carl-mastrangelo · 2017-12-05T18:25:27Z

@xiaoshuang-lu that means they are running your application code. What is the QPS you get?

xiaoshuang-lu · 2017-12-06T01:21:36Z

requests-per-second: about 220k (expected: 1000MB/400B)
reponse-time: p99 25ms, p999 33ms (expected: p99 below 4ms)

My environment and configurations are listed as follows.

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g
bossGroup: 4 threads
workerGroup: 160 threads
executor: ThreadPoolExecutor(64, 64, 1000 * 60 * 10, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(65536))

carl-mastrangelo · 2017-12-06T01:35:10Z

From the output of that text, it looks like you didn't use ForkJoinPool.

xiaoshuang-lu · 2017-12-06T02:02:15Z

No, I didn't.

Yesterday I have tried ForkJoinPool instead of ThreadPoolExecutor. Unfortunately JVM crashed. I will investigate this problem. Thread dump hints that bottlenecks are in SerializingExecutor.

xiaoshuang-lu · 2017-12-06T02:15:29Z

Hi @carl-mastrangelo, are there some performance testing reports the users can refer to?

carl-mastrangelo · 2017-12-06T18:23:23Z

We have a live dashboard of performance by language:
https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584

You can also look at our benchmarks/ directory to see the code that performs the benchmarks

We don't have any tutorials yet on how to optimize.

xiaoshuang-lu · 2017-12-07T08:48:07Z

Thank you, @carl-mastrangelo. I will try the programs in benchmarks directory.

xiaoshuang-lu · 2017-12-07T11:58:31Z

QPS in Unary secure throughput server CPU (32 core client to 32 core server) is about 850k. I ran the benchmarks (qps_server and qps_client) in my environment. But I just could not achieve QPS as the live dashboard. Here are my commands. Did I make something wrong?

nohup sh bin/qps_server --address=10.110.60.40:9999 --transport=netty_nio > nohup.log 2>&1 &
nohup sh bin/qps_client --address=10.110.60.40:9999 --channels=10 --outstanding_rpcs=100 --client_payload=256 --server_payload=256 --duration=100000000 > nohup.log 2>&1 &

PS: I used 40 threads as netty server workers.

carl-mastrangelo · 2017-12-07T18:01:56Z

And what QPS did you get with that configuration? If you could include information like the load averages, or strace output, or running perf record when running that would be helpful. 40 Netty threads is a lot. You should be able to get by with a lot fewer.

Alternatively, if you could post your code it would reduce the number of roundtrips on this issue.

xiaoshuang-lu · 2017-12-08T12:11:28Z

I will upload my code package next week. Thanks.

xiaoshuang-lu · 2017-12-14T11:07:28Z

@carl-mastrangelo Here is my code package. :-)

grpc-benchmark.zip

carl-mastrangelo · 2017-12-14T18:59:20Z

So as I mentioned up above, you need to replace ThreadPoolExecutor with ForkJoinPool in both GRPCTestServiceServer and GRPCTestServiceClient. You can even use the same pool on both. Consider using ForkJoinPool.commonPool().

Second, you have too many netty threads. In normal, non-grpc netty, worker threads do most of the work. In gRPC, we move work off the net threads as soon as possible. The net effect to you is that you really only need ~1-2 workers in your Event Loop.

xiaoshuang-lu · 2017-12-21T05:52:59Z

It does not make any difference.

carl-mastrangelo · 2018-01-03T19:39:36Z

If you can give a reproducible example I can look into this further. I will need the exact command you used to run and the specs of the hardware you ran on.

xiaoshuang-lu · 2018-01-04T02:05:52Z

Hardware and Software Specs

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g

Server Command

nohup java -Xms4g -Xmx8g -Dcom.sun.management.jmxremote.host=10.1.1.1 -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/grpc -cp grpc-benchmark-1.0.0-SNAPSHOT-jar-with-dependencies.jar:. root.GRPCTestServiceServer 9999 > nohup.log 2>&1 &

Client Command

nohup java -Xms4g -Xmx8g -Dcom.sun.management.jmxremote.host=10.1.1.2 -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/grpc -cp grpc-benchmark-1.0.0-SNAPSHOT-jar-with-dependencies.jar:. root.GRPCTestServiceClient 10.1.1.1 9999 50 10 > nohup.log 2>&1 &

carl-mastrangelo · 2018-01-04T19:59:12Z

@xiaoshuang-lu

I ran the code you provided in grpc-benchmark. I made the following changes:

Used ForkJoinPool.commonPool() instead of ThreadPoolExecutor.
Reduced the NioEventLoopGroup size to 1.

Some things are kind of odd:

In the Client, the number of threads is wrong. You should make 1 Channel, and give it an executor with the correct number of threads. As is, you are making many 1 Channel per "Thread", which in turn creates a threadpool. This results in threadCount^2 number of threads!
You are making an event loop group per channel. This is extremely inefficient. Netty works best when there is a lot of network activity, because it polls for work. If it doesn't find any work, it goes to sleep waiting for more work to arrive. Each time it goes to sleep and wakes, it wastes a lot of time. Since you are making a group per channel, you silo the work and ensure there is never more work to do.
The code as provided never finishes, and doesn't print out results. I can't tell how fast it is going because it doesn't tell me. The tracers look like they record info, but they don't seem to ever be read.

xiaoshuang-lu · 2018-01-05T02:52:29Z

@carl-mastrangelo Thank you very much. I appreciate your reply.

In the Client, the number of threads is wrong. You should make 1 Channel, and give it an executor with the correct number of threads. As is, you are making many 1 Channel per "Thread", which in turn creates a threadpool. This results in threadCount^2 number of threads!

Each client has 4 threads, a thread pool with 2 threads and a event loop group with 2 threads. So the total thread number should be clientNumber * 4.

You are making an event loop group per channel. This is extremely inefficient. Netty works best when there is a lot of network activity, because it polls for work. If it doesn't find any work, it goes to sleep waiting for more work to arrive. Each time it goes to sleep and wakes, it wastes a lot of time. Since you are making a group per channel, you silo the work and ensure there is never more work to do.

I limited the event loop group thread number per client to 2. As mentioned above this may be not a problem.

The code as provided never finishes, and doesn't print out results. I can't tell how fast it is going because it doesn't tell me. The tracers look like they record info, but they don't seem to ever be read.

I leveraged dropwizard metrics to collect statistics data. The code can be found in GRPCTestServiceServer.java line 45 and GRPCTestServiceClient.java line 90. You can add reporters (e.g. GraphiteReporter) to MetricRegistry.

Again, thanks Carl.

ejona86 · 2018-03-08T21:52:12Z

Looks like this is resolved. Closing.

xiaoshuang-lu · 2018-03-09T01:25:09Z

I am afraid this issue is not resolved... :(

ejona86 · 2018-04-02T21:46:57Z

@carl-mastrangelo, do you know what needs to be done here?

carl-mastrangelo · 2018-04-09T22:56:01Z

Without more data about the actual run, I don't know what more can be done here. @xiaoshuang-lu If you follow the instructions here: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166

It can give you much more detailed data about why grpc is slow.

xiaoshuang-lu · 2018-04-12T03:41:33Z

Hi, I think it may not be difficult to reproduce this issue if there is a ganglia server or something similar.

Add GangliaReporterss to GRPCTestServiceServer.java line 46 and GRPCTestServiceClient.java line 91.

zhangkun83 · 2018-10-09T17:29:36Z

@xiaoshuang-lu, did you have success on getting a repro?

zhangkun83 · 2018-10-16T16:41:21Z

Closing it due to inactivity. @xiaoshuang-lu, feel free to reopen if you have new updates.

carl-mastrangelo added the question label Dec 7, 2017

ejona86 closed this as completed Mar 8, 2018

ejona86 reopened this Mar 23, 2018

ejona86 assigned carl-mastrangelo Mar 23, 2018

zhangkun83 closed this as completed Oct 16, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server performance issues #3826

server performance issues #3826

xiaoshuang-lu commented Dec 4, 2017 •

edited

Loading

anhldbk commented Dec 4, 2017

xiaoshuang-lu commented Dec 4, 2017 •

edited

Loading

carl-mastrangelo commented Dec 4, 2017

xiaoshuang-lu commented Dec 5, 2017

carl-mastrangelo commented Dec 5, 2017

xiaoshuang-lu commented Dec 5, 2017

carl-mastrangelo commented Dec 5, 2017

xiaoshuang-lu commented Dec 6, 2017

carl-mastrangelo commented Dec 6, 2017

xiaoshuang-lu commented Dec 6, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 6, 2017

carl-mastrangelo commented Dec 6, 2017

xiaoshuang-lu commented Dec 7, 2017

xiaoshuang-lu commented Dec 7, 2017 •

edited

Loading

carl-mastrangelo commented Dec 7, 2017

xiaoshuang-lu commented Dec 8, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 14, 2017

carl-mastrangelo commented Dec 14, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 21, 2017

carl-mastrangelo commented Jan 3, 2018

xiaoshuang-lu commented Jan 4, 2018 •

edited

Loading

carl-mastrangelo commented Jan 4, 2018

xiaoshuang-lu commented Jan 5, 2018

ejona86 commented Mar 8, 2018

xiaoshuang-lu commented Mar 9, 2018

ejona86 commented Apr 2, 2018

carl-mastrangelo commented Apr 9, 2018

xiaoshuang-lu commented Apr 12, 2018

zhangkun83 commented Oct 9, 2018

zhangkun83 commented Oct 16, 2018

server performance issues #3826

server performance issues #3826

Comments

xiaoshuang-lu commented Dec 4, 2017 • edited Loading

anhldbk commented Dec 4, 2017

xiaoshuang-lu commented Dec 4, 2017 • edited Loading

carl-mastrangelo commented Dec 4, 2017

xiaoshuang-lu commented Dec 5, 2017

carl-mastrangelo commented Dec 5, 2017

xiaoshuang-lu commented Dec 5, 2017

carl-mastrangelo commented Dec 5, 2017

xiaoshuang-lu commented Dec 6, 2017

carl-mastrangelo commented Dec 6, 2017

xiaoshuang-lu commented Dec 6, 2017 • edited Loading

xiaoshuang-lu commented Dec 6, 2017

carl-mastrangelo commented Dec 6, 2017

xiaoshuang-lu commented Dec 7, 2017

xiaoshuang-lu commented Dec 7, 2017 • edited Loading

carl-mastrangelo commented Dec 7, 2017

xiaoshuang-lu commented Dec 8, 2017 • edited Loading

xiaoshuang-lu commented Dec 14, 2017

carl-mastrangelo commented Dec 14, 2017 • edited Loading

xiaoshuang-lu commented Dec 21, 2017

carl-mastrangelo commented Jan 3, 2018

xiaoshuang-lu commented Jan 4, 2018 • edited Loading

carl-mastrangelo commented Jan 4, 2018

xiaoshuang-lu commented Jan 5, 2018

ejona86 commented Mar 8, 2018

xiaoshuang-lu commented Mar 9, 2018

ejona86 commented Apr 2, 2018

carl-mastrangelo commented Apr 9, 2018

xiaoshuang-lu commented Apr 12, 2018

zhangkun83 commented Oct 9, 2018

zhangkun83 commented Oct 16, 2018

xiaoshuang-lu commented Dec 4, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 4, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 6, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 7, 2017 •

edited

Loading

xiaoshuang-lu commented Dec 8, 2017 •

edited

Loading

carl-mastrangelo commented Dec 14, 2017 •

edited

Loading

xiaoshuang-lu commented Jan 4, 2018 •

edited

Loading