Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server performance issues #3826

Closed
xiaoshuang-lu opened this issue Dec 4, 2017 · 30 comments
Closed

server performance issues #3826

xiaoshuang-lu opened this issue Dec 4, 2017 · 30 comments
Assignees
Labels

Comments

@xiaoshuang-lu
Copy link
Contributor

xiaoshuang-lu commented Dec 4, 2017

Hi grpc-java guys, I have wrote some programs to do the performance testing for grpc-java. The service is very simple, just echo the strings (about 400 bytes per request) received from clients. (Actually, I disabled servers' log.) Appears that it's is difficult for me to archive high requests-per-second and low response-time.

requests-per-second: about 220k (expected: 1000MB/400B)
reponse-time: p99 25ms, p999 33ms (expected: p99 below 4ms)

My environment and configurations are listed as follows.

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g
bossGroup: 4 threads
workerGroup: 160 threads
executor: ThreadPoolExecutor(64, 64, 1000 * 60 * 10, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(65536))

threaddump.txt

According to the thread dump, many worker threads stick to SerializingExecutor.

@anhldbk
Copy link

anhldbk commented Dec 4, 2017

@xiaoshuang-lu would you mind sharing the code to construct your gRPC server? The following is mine and it works well:

        Executor executor = Utils.getDefaultExecutor(threadNum);
        bossEventGroup = new EpollEventLoopGroup();
        workerEventGroup = new EpollEventLoopGroup();
        server = NettyServerBuilder.forPort(port)
                .addService(
                        ServerInterceptors.intercept(
                                new Service(backendFutureStub),
                                new AuthInterceptor()
                        )
                )
                .bossEventLoopGroup(bossEventGroup)
                .workerEventLoopGroup(workerEventGroup)
                .channelType(EpollServerSocketChannel.class)
                .executor(executor)
                .build();

@xiaoshuang-lu
Copy link
Contributor Author

xiaoshuang-lu commented Dec 4, 2017

Server

        Class<? extends ServerChannel> channelClass;
        if (RuntimeUtils.isLinux() && this.config.getLeverageEPoll()) {
            this.bossGroup =
                new EpollEventLoopGroup(
                    this.config.getAcceptorThreadNumber(),
                    new AtomThreadFactory("ServerAcceptor"));
            this.workerGroup =
                new EpollEventLoopGroup(
                    this.config.getWorkerThreadNumber(),
                    new AtomThreadFactory("ServerWorker"));

            channelClass = EpollServerSocketChannel.class;
        } else {
            this.bossGroup =
                new NioEventLoopGroup(
                    this.config.getAcceptorThreadNumber(),
                    new AtomThreadFactory("ServerAcceptor"));
            this.workerGroup =
                new NioEventLoopGroup(
                    this.config.getWorkerThreadNumber(),
                    new AtomThreadFactory("ServerWorker"));

            channelClass = NioServerSocketChannel.class;
        }

        if (this.config.getBusinessLogicThreadType() == BusinessLogicThreadType.DEDICATED) {
            ThreadPoolExecutor threadPoolExecutor =
                new ThreadPoolExecutor(
                    this.config.getThreadPoolCoreSize(),
                    this.config.getThreadPoolMaxSize(),
                    this.config.getThreadPoolKeepAliveTime(),
                    TimeUnit.MILLISECONDS,
                    new LinkedBlockingQueue<Runnable>(this.config.getThreadPoolQueueSize()),
                    new AtomThreadFactory("BusinessLogicThread"));

            threadPoolExecutor.allowCoreThreadTimeOut(this.config.getThreadPoolAllowCoreThreadTimeout());

            threadPoolExecutor.prestartAllCoreThreads();

            if (this.config.getMetricEnabled() && this.metricRegistry != null) {
                MetricUtils.prepareThreadPoolExecutorMetricItems(this.metricRegistry, threadPoolExecutor);
            }

            this.executorService = threadPoolExecutor;
        }

        this.serverBuilder = NettyServerBuilder.forPort(this.config.getPort());

        NettyServerBuilder nettyServerBuilder = (NettyServerBuilder)(this.serverBuilder);

        nettyServerBuilder.bossEventLoopGroup(this.bossGroup);
        nettyServerBuilder.workerEventLoopGroup(this.workerGroup);
        nettyServerBuilder.channelType(channelClass);
        nettyServerBuilder.executor(this.executorService);

        nettyServerBuilder.flowControlWindow(this.config.getHttp2SettingsInitialWindowSize());
         
nettyServerBuilder.maxConcurrentCallsPerConnection(this.config.getHttp2SettingsMaxConcurrentStreams());
        nettyServerBuilder.maxHeaderListSize(this.config.getHttp2SettingsMaxHeaderListSize());
        nettyServerBuilder.keepAliveTime(this.config.getPingInterval(), TimeUnit.MILLISECONDS);
        nettyServerBuilder.keepAliveTimeout(this.config.getPingTimeout(), TimeUnit.MILLISECONDS);
        if (this.config.getTlsEnabled()) {
            //nettyServerBuilder.sslContext(this.sslContext);
        }

        for (int i = 0; i < this.serviceDefinitionList.size(); ++ i) {
            nettyServerBuilder.addService(
                (BindableService)(this.serviceDefinitionList.get(i).getServiceImpl()));
        }
        MetricsServerTransportFilter metricsServerTransportFilter = new MetricsServerTransportFilter(this.metricRegistry);

        this.serverBuilder.addTransportFilter(metricsServerTransportFilter);

        MetricsServerStreamTracerFactory metricsServerStreamTracerFactory = new MetricsServerStreamTracerFactory();

        metricsServerStreamTracerFactory.setStopwatchSupplier(GrpcUtil.STOPWATCH_SUPPLIER);
        metricsServerStreamTracerFactory.setSlowCallTimeThreshold(this.config.getSlowCallTimeThreshold());
        metricsServerStreamTracerFactory.setMetricRegistry(this.metricRegistry);

        this.serverBuilder.addStreamTracerFactory(metricsServerStreamTracerFactory);

        for (ServerTransportFilter transportFilter : this.transportFilterList) {
            this.serverBuilder.addTransportFilter(transportFilter);
        }

        for (ServerInterceptor interceptor : this.interceptorList) {
            this.serverBuilder.intercept(interceptor);
        }

        for (ServerStreamTracer.Factory streamTracerFactory : this.streamTracerFactoryList) {
            this.serverBuilder.addStreamTracerFactory(streamTracerFactory);
        }

Service

    @Override
    public void test(
        GRPCPerformanceTestingService.PerformanceTestingServiceRequest request,
        StreamObserver<GRPCPerformanceTestingService.PerformanceTestingServiceResponse> streamObserver) {
        if (LOGGER.isDebugEnabled()) {
            LOGGER.debug(request.getData());
        }

        GRPCPerformanceTestingService.PerformanceTestingServiceResponse response =
            GRPCPerformanceTestingService.
            PerformanceTestingServiceResponse.
            newBuilder().
            setData(request.getData()).
            build();

        streamObserver.onNext(response);

        streamObserver.onCompleted();
    }

@carl-mastrangelo
Copy link
Contributor

@xiaoshuang-lu Can you try using ForkJoinPool instead of ThreadPoolExecutor? It's what we use for gRPC benchmarks because it has better behavior under thread contention.

@xiaoshuang-lu
Copy link
Contributor Author

@carl-mastrangelo If ForkJoinPool is used, do we probably exhaust threads?

@carl-mastrangelo
Copy link
Contributor

@xiaoshuang-lu I think they are created as needed. They don't go above the desired parallelism unless you use ManagedBlocker. (this is rare).

@xiaoshuang-lu
Copy link
Contributor Author

Thread dump indicates that most netty workers are in SerializingExecutor.execute().

@carl-mastrangelo
Copy link
Contributor

@xiaoshuang-lu that means they are running your application code. What is the QPS you get?

@xiaoshuang-lu
Copy link
Contributor Author

requests-per-second: about 220k (expected: 1000MB/400B)
reponse-time: p99 25ms, p999 33ms (expected: p99 below 4ms)

My environment and configurations are listed as follows.

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g
bossGroup: 4 threads
workerGroup: 160 threads
executor: ThreadPoolExecutor(64, 64, 1000 * 60 * 10, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(65536))

@carl-mastrangelo
Copy link
Contributor

From the output of that text, it looks like you didn't use ForkJoinPool.

@xiaoshuang-lu
Copy link
Contributor Author

xiaoshuang-lu commented Dec 6, 2017

No, I didn't.

Yesterday I have tried ForkJoinPool instead of ThreadPoolExecutor. Unfortunately JVM crashed. I will investigate this problem. Thread dump hints that bottlenecks are in SerializingExecutor.

@xiaoshuang-lu
Copy link
Contributor Author

Hi @carl-mastrangelo, are there some performance testing reports the users can refer to?

@carl-mastrangelo
Copy link
Contributor

We have a live dashboard of performance by language:
https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584

You can also look at our benchmarks/ directory to see the code that performs the benchmarks

We don't have any tutorials yet on how to optimize.

@xiaoshuang-lu
Copy link
Contributor Author

Thank you, @carl-mastrangelo. I will try the programs in benchmarks directory.

@xiaoshuang-lu
Copy link
Contributor Author

xiaoshuang-lu commented Dec 7, 2017

QPS in Unary secure throughput server CPU (32 core client to 32 core server) is about 850k. I ran the benchmarks (qps_server and qps_client) in my environment. But I just could not achieve QPS as the live dashboard. Here are my commands. Did I make something wrong?

nohup sh bin/qps_server --address=10.110.60.40:9999 --transport=netty_nio > nohup.log 2>&1 &
nohup sh bin/qps_client --address=10.110.60.40:9999 --channels=10 --outstanding_rpcs=100 --client_payload=256 --server_payload=256 --duration=100000000 > nohup.log 2>&1 &

PS: I used 40 threads as netty server workers.

@carl-mastrangelo
Copy link
Contributor

And what QPS did you get with that configuration? If you could include information like the load averages, or strace output, or running perf record when running that would be helpful. 40 Netty threads is a lot. You should be able to get by with a lot fewer.

Alternatively, if you could post your code it would reduce the number of roundtrips on this issue.

@xiaoshuang-lu
Copy link
Contributor Author

xiaoshuang-lu commented Dec 8, 2017

I will upload my code package next week. Thanks.

@xiaoshuang-lu
Copy link
Contributor Author

@carl-mastrangelo Here is my code package. :-)

grpc-benchmark.zip

@carl-mastrangelo
Copy link
Contributor

carl-mastrangelo commented Dec 14, 2017

So as I mentioned up above, you need to replace ThreadPoolExecutor with ForkJoinPool in both GRPCTestServiceServer and GRPCTestServiceClient. You can even use the same pool on both. Consider using ForkJoinPool.commonPool().

Second, you have too many netty threads. In normal, non-grpc netty, worker threads do most of the work. In gRPC, we move work off the net threads as soon as possible. The net effect to you is that you really only need ~1-2 workers in your Event Loop.

@xiaoshuang-lu
Copy link
Contributor Author

It does not make any difference.

@carl-mastrangelo
Copy link
Contributor

If you can give a reproducible example I can look into this further. I will need the exact command you used to run and the specs of the hardware you ran on.

@xiaoshuang-lu
Copy link
Contributor Author

xiaoshuang-lu commented Jan 4, 2018

Hardware and Software Specs

Environment: 48 cores/128GB memory/10000Mbps net/CentOS Linux release 7.2.1511
gRPC Version: 1.8.0
JVM options: JDK 1.7.80 -Xms4g -Xmx8g

Server Command

nohup java -Xms4g -Xmx8g -Dcom.sun.management.jmxremote.host=10.1.1.1 -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/grpc -cp grpc-benchmark-1.0.0-SNAPSHOT-jar-with-dependencies.jar:. root.GRPCTestServiceServer 9999 > nohup.log 2>&1 &

Client Command

nohup java -Xms4g -Xmx8g -Dcom.sun.management.jmxremote.host=10.1.1.2 -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/grpc -cp grpc-benchmark-1.0.0-SNAPSHOT-jar-with-dependencies.jar:. root.GRPCTestServiceClient 10.1.1.1 9999 50 10 > nohup.log 2>&1 &

@carl-mastrangelo
Copy link
Contributor

@xiaoshuang-lu

I ran the code you provided in grpc-benchmark. I made the following changes:

  • Used ForkJoinPool.commonPool() instead of ThreadPoolExecutor.
  • Reduced the NioEventLoopGroup size to 1.

Some things are kind of odd:

  • In the Client, the number of threads is wrong. You should make 1 Channel, and give it an executor with the correct number of threads. As is, you are making many 1 Channel per "Thread", which in turn creates a threadpool. This results in threadCount^2 number of threads!
  • You are making an event loop group per channel. This is extremely inefficient. Netty works best when there is a lot of network activity, because it polls for work. If it doesn't find any work, it goes to sleep waiting for more work to arrive. Each time it goes to sleep and wakes, it wastes a lot of time. Since you are making a group per channel, you silo the work and ensure there is never more work to do.
  • The code as provided never finishes, and doesn't print out results. I can't tell how fast it is going because it doesn't tell me. The tracers look like they record info, but they don't seem to ever be read.

@xiaoshuang-lu
Copy link
Contributor Author

@carl-mastrangelo Thank you very much. I appreciate your reply.

  • In the Client, the number of threads is wrong. You should make 1 Channel, and give it an executor with the correct number of threads. As is, you are making many 1 Channel per "Thread", which in turn creates a threadpool. This results in threadCount^2 number of threads!

Each client has 4 threads, a thread pool with 2 threads and a event loop group with 2 threads. So the total thread number should be clientNumber * 4.

  • You are making an event loop group per channel. This is extremely inefficient. Netty works best when there is a lot of network activity, because it polls for work. If it doesn't find any work, it goes to sleep waiting for more work to arrive. Each time it goes to sleep and wakes, it wastes a lot of time. Since you are making a group per channel, you silo the work and ensure there is never more work to do.

I limited the event loop group thread number per client to 2. As mentioned above this may be not a problem.

  • The code as provided never finishes, and doesn't print out results. I can't tell how fast it is going because it doesn't tell me. The tracers look like they record info, but they don't seem to ever be read.

I leveraged dropwizard metrics to collect statistics data. The code can be found in GRPCTestServiceServer.java line 45 and GRPCTestServiceClient.java line 90. You can add reporters (e.g. GraphiteReporter) to MetricRegistry.

Again, thanks Carl.

@ejona86
Copy link
Member

ejona86 commented Mar 8, 2018

Looks like this is resolved. Closing.

@ejona86 ejona86 closed this as completed Mar 8, 2018
@xiaoshuang-lu
Copy link
Contributor Author

I am afraid this issue is not resolved... :(

@ejona86
Copy link
Member

ejona86 commented Apr 2, 2018

@carl-mastrangelo, do you know what needs to be done here?

@carl-mastrangelo
Copy link
Contributor

Without more data about the actual run, I don't know what more can be done here. @xiaoshuang-lu If you follow the instructions here: https://medium.com/netflix-techblog/java-in-flames-e763b3d32166

It can give you much more detailed data about why grpc is slow.

@xiaoshuang-lu
Copy link
Contributor Author

Hi, I think it may not be difficult to reproduce this issue if there is a ganglia server or something similar.

Add GangliaReporterss to GRPCTestServiceServer.java line 46 and GRPCTestServiceClient.java line 91.

@zhangkun83
Copy link
Contributor

@xiaoshuang-lu, did you have success on getting a repro?

@zhangkun83
Copy link
Contributor

Closing it due to inactivity. @xiaoshuang-lu, feel free to reopen if you have new updates.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants