Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication from c++ server to python client is too slow #22260

Open
CoinCheung opened this issue Mar 7, 2020 · 23 comments
Open

Communication from c++ server to python client is too slow #22260

CoinCheung opened this issue Mar 7, 2020 · 23 comments

Comments

@CoinCheung
Copy link

I am not sure whether this is a bug or a problem of my configuration, my code list like this:

//server.cpp
#include <iostream>
#include <string>
#include <vector>
#include <memory>
#include <chrono>

#include <grpcpp/grpcpp.h>

#include "comm/interface.grpc.pb.h"
#include "comm/interface.pb.h"


using std::string;
using std::vector;
using std::unique_ptr;
using std::cout;
using std::endl;

using grpc::Status;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::ServerWriter;

class ImageServiceImpl final: public comm::ImageService::Service {
    public:
        ImageServiceImpl();
        Status get_batch(ServerContext *context,  const comm::BatchRequest *request, comm::BatchReply* reply) override;
};

ImageServiceImpl::ImageServiceImpl() {}
Status ImageServiceImpl::get_batch(ServerContext *context,
        const comm::BatchRequest *request,
        comm::BatchReply* reply) {

    int64_t buf_len = 1024 * 224 * 224 * 3 * sizeof(float);
    vector<float> imbuf(buf_len);
    auto start = std::chrono::steady_clock::now();

    for (int i{0}; i < 1024; ++i) reply->add_labels(1);
    reply->add_shape(1024);
    reply->add_shape(224);
    reply->add_shape(224);
    reply->add_shape(3);
    reply->set_dtype("float32");
    string *sptr = reply->mutable_data();
    sptr->resize(buf_len);
    memset(&(sptr->at(0)), 1, buf_len);

    auto end = std::chrono::steady_clock::now();
    cout << "time is: " << std::chrono::duration<double, std::milli>(end - start).count() << endl;
    return Status::OK;
}

int main() {
    string addr("0.0.0.0:50004");
    ImageServiceImpl service;
    ServerBuilder builder;
    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
    builder.RegisterService(&service);
    builder.SetMaxSendMessageSize(1L << 31);
    builder.SetMaxReceiveMessageSize(1L << 31);
    std::unique_ptr<grpc::Server> server(builder.BuildAndStart());
    std::cout << "service started, listening to: " << addr << std::endl;

    server->Wait();
    return 0;
}

And the client is like this:

import logging
import time
import sys
sys.path.insert(0, './commpy')

import grpc
import numpy as np

from commpy.interface_pb2 import BatchRequest
from commpy.interface_pb2_grpc import ImageServiceStub

def run_get_batch():
    with grpc.insecure_channel(
            'localhost:50004',
            options=[
                ('grpc.max_receive_message_length', 1 << 30),
                ('grpc.max_send_message_length', 1 << 30),
            ]
        ) as channel:
        stub = ImageServiceStub(channel)
        req = BatchRequest()
        reply = stub.get_batch(req)

    ims = np.frombuffer(
            reply.data, dtype=reply.dtype).reshape(reply.shape)
    print(ims[:4, :1, :1, 0].ravel())

if __name__ == "__main__":
    logging.basicConfig()
    for _ in range(1400):
        t1 = time.time()
        run_get_batch()
        t2 = time.time()
        print('time is: {:.4f}'.format(t2 - t1))

My protobuf is like this:

syntax = "proto3";

package comm;

service ImageService {
    rpc get_batch (BatchRequest) returns (BatchReply) {}
}
message BatchRequest {
}
message BatchReply {
    bytes data = 1;
    repeated int32 shape = 2;
    string dtype = 3;
    repeated int64 labels = 4;
}

The problem is that when I run this code, the c++ server claims that it uses about 200ms to finish its task, but the python client claims that it needs 2s to finish one iter. I do not know where is the overhead comes from and how could I make the client as fast as possible. Do you have any suggestions on this please ?

@CoinCheung
Copy link
Author

by the way, I also noticed that the python side gets slower and slower as the loop goes. It first took 2s to finish each iter, and after several tens of iters, it took about 6s, and after more iters, the latency can be as much as 30s.

By observing the output of command htop, I saw that one cpu core works at 100% during the transmission. Is this my configuration problem or some bug of python-grpc ?

@veblush veblush assigned gnossen and unassigned veblush Mar 12, 2020
@gnossen
Copy link
Contributor

gnossen commented Mar 12, 2020

@CoinCheung My first observation looking at your client code is that you seem to be spinning up and tearing down a channel as part of your loop. While this use case is supported, it's not going to be performant.

The increasing latency with each iteration sounds weird and, if true, is probably legitimately a bug. 100% CPU also isn't normal. Can you please work things around so that you use a spin up a single shared channel for your client process and see how that affects the metrics you're gathering?

@CoinCheung
Copy link
Author

Actually, I did have tried it, the code is like this:

def run_get_batch(stub):
    req = BatchRequest()
    print('before call')
    reply = stub.get_batch(req)
    print('after call')

    ims = np.frombuffer(
            reply.data, dtype=reply.dtype).reshape(reply.shape)
    lbs = np.array(reply.labels)
    print(ims.shape)


if __name__ == "__main__":
  
    with grpc.insecure_channel(
            'localhost:50001',
            options=[
                ('grpc.max_receive_message_length', 1 << 30),
                ('grpc.max_send_message_length', 1 << 30),
            ]
        ) as channel:
        stub = ImageServiceStub(channel)
        for _ in range(1300):
            t1 = time.time()
            run_get_batch()
            t2 = time.time()
            print('time is: {:.4f}'.format(t2 - t1))

I actually observed no difference. The speed would get slower and slower as reported above.

@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

@CoinCheung
Copy link
Author

??? Could an issued be closed when nobody noticed it for a long time?

@stale stale bot removed the disposition/stale label May 6, 2020
@SimSmith
Copy link

I've noticed something very similar, except my server was written in Go. The Python client is always taking more than 2 seconds to execute its first RPC within a connection/channel. Subsequent calls finish in about a millisecond each. If the server is written in Python instead, the Python client takes about 10 ms to do its first RPC.

This can also be noticed when running the helloworld examples. E.g. the server is helloworld in Go and the client is from helloworld in Python with extra code for taking the time:

from __future__ import print_function
import logging
import time  #### added this
import grpc

import helloworld_pb2
import helloworld_pb2_grpc


def run():
    with grpc.insecure_channel('localhost:50051') as channel:
        stub = helloworld_pb2_grpc.GreeterStub(channel)
        start = time.time()  #### added this
        response = stub.SayHello(helloworld_pb2.HelloRequest(name='you'))
        end = time.time() - start  #### added this
        print("Time in seconds:", end)  #### added this
    print("Greeter client received: " + response.message)


if __name__ == '__main__':
    logging.basicConfig()
    run()

Run these together and greeter_client.py will print Time in seconds: 2.016329765319824. I haven't tried C++´s helloworld, but is sounds like it's the same problem.

@CoinCheung
Copy link
Author

Does the community abandon python api?

@SimSmith
Copy link

They are at least still updating the Python library.
@CoinCheung, have you tried the example I gave above?
I'm curious if you are able to recreate it by either using C++ or Go as the helloworld server.

@gnossen
Copy link
Contributor

gnossen commented May 21, 2020

@CoinCheung Sorry for the slow response on this thread. I haven't had many free cycles in the recent past. I'll take a take a look in the next couple of days.

@gnossen
Copy link
Contributor

gnossen commented May 26, 2020

@CoinCheung I pulled down your example and ran it, excising the numpy dependency to simplify the build. I cannot reproduce an increase in end-to-end client latency over time. It looks pretty stable to me. Client Latency

I tried running a profile of your setup and found that the majority of the time is spent in libc, performing a copy, while an additional 23% of the time is spent in the kernel, presumably on driving the TCP stack. (I don't have kernel debug symbols installed on this box)

  Overhead  Command          Shared Object        Symbol                                                                                                                                               
+   34.98%  python3          libc-2.30.so         [.] __memmove_avx_unaligned_erms                                                                                                                     
+    9.46%  python3          [kernel.kallsyms]    [k] 0xffffffffbc17405e                                                                                                                               
+    8.41%  python3          [kernel.kallsyms]    [k] 0xffffffffbc173b67                                                                                                                               
+    6.12%  python3          [kernel.kallsyms]    [k] 0xffffffffbc173ffe    

This is presumably due to the very large protos you are pushing between the client and the server.

@SimSmith
Copy link

@CoinCheung are you running the client on Windows?

@gnossen
Copy link
Contributor

gnossen commented May 27, 2020

@SimSmith No. I ran this on Linux 5.2.17. I didn't see any mention of Windows in this thread.

@CoinCheung Can you confirm what platform you're running on?

@CoinCheung
Copy link
Author

@gnossen I am running it on linux and the output of command lsb_release is:

LSB Version:    :core-4.1-amd64:core-4.1-noarch

Also, I am not running on it directly, I am running the program in the ubuntu16.04 docker environment on that machine.

@CoinCheung
Copy link
Author

My environment has been pushed to dockerhub, please look this issue for my environment.

@gnossen
Copy link
Contributor

gnossen commented May 29, 2020

@CoinCheung Thank you for the repro container! I'm still unable to reproduce though. The RPC latencies I'm recording don't look abnormal at all:
RPC Latency (seconds)

I'm not including the time to spin up a channel in these measurements, since the code in the example spins up and tears down a channel for each individual RPC. Instead, you should be using a single channel per worker subprocess. Take a look at the multiprocessing example for the particulars there.

@ljch2018
Copy link

@CoinCheung I met the same problem. Have you found any clues ? Thanks.

@CoinCheung
Copy link
Author

@ljch2018 Not exactly, the developer of this repo indicates that maybe I should update my operation system kernel, but I have not tried it yet.

@ljch2018
Copy link

@CoinCheung I have tried to deploy my codes on multiple versions of kernel OS, but nothing changed.

@gnossen
Copy link
Contributor

gnossen commented Jul 17, 2020

@ljch2018 If you'd like to file a new issue with a reproduction case, we'll be happy to take a look. The problem in this case was that I could not reproduce the latency issues on any of my machines.

@leannc
Copy link

leannc commented Nov 16, 2020

I met the same problem. my Server is C++ and Client is js in electron, I had a very small .proto.I logged in both client and server,when I called the serivce in client, it took 1-2 seconds for the server to get it, the server returned as soon as it had been called,but it took 2-5 seconds for the client to receive it!

@leannc
Copy link

leannc commented Nov 17, 2020

I met the same problem. my Server is C++ and Client is js in electron, I had a very small .proto.I logged in both client and server,when I called the serivce in client, it took 1-2 seconds for the server to get it, the server returned as soon as it had been called,but it took 2-5 seconds for the client to receive it!

I found out that when I refresh my electron after it lauched, the grpc speed gets normal! then I turned off the renderer process debugger(I had attached the debugger in launch.json immediately after main process), All things gets right! I built it and tested, the speed is OK. I don't know why the renderer process debugger could affect this,but hope this is helpful to guys have the same problem like me.

BY THE WAY, I logged the waitForClientReady function, it returns the error info "Failed to connect before the deadline" when I connected successfully, but do not call the callback function when I failed -- first time turn on the app with the renderer process debugger attached.

@preranam05
Copy link

Could you tell me how to run the server.py, client.py and protobuf? PLease

@wangqianscu
Copy link

Hello, why I tried your code but python client could not connect the sever? is there any envoriement need set?thanks a lot!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants