New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency while doing concurrent unary rpc over same connection(grpc-go) #1512

Closed
janardhan1993 opened this Issue Sep 12, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@janardhan1993
Copy link

janardhan1993 commented Sep 12, 2017

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.6

What version of Go are you using (go version)?

1.9

What operating system (Linux, Windows, …) and version?

ubuntu 16.04

What did you do?

If possible, provide a recipe for reproducing the error.
I wrote a test to check grpc latency for unary RPC on my machine.
Link to the code: https://play.golang.org/p/YgS_xdzN_t

What did you expect to see?

I expected to see less latency for requests. We care more about the latency than the throughput for some use cases. Are there any options which we can specify on the client/server side to avoid these latency issues.

What did you see instead?

I noticed that the latency was very high, around 30-40ms when i make 2000 concurrent unary rpcs with payload of 1kb.
But if i do only one rpc at a time then the latency was only 100 us. I noticed that the latency dropped down to around 1ms if
i do only 200 concurrent requests at a time or increase the payload size to 10kb.

I noticed that data is not flushed when there are concurrent writers in http_client.go, which might be adding to the latency.

@janardhan1993 janardhan1993 changed the title High latency while concurrent unary rpc over same connection(grpc-go) High latency while doing concurrent unary rpc over same connection(grpc-go) Sep 12, 2017

@MakMukhi

This comment has been minimized.

Copy link
Contributor

MakMukhi commented Sep 12, 2017

From my understanding the increased latency is because the sheer load 2000 concurrent streams put on the CPU and memory. I ran our benchmark for 2000 concurrent streams, 1kb request and response, Unary rpcs and found CPU utilization at 450%.
Note that batching is actually good when there are multiple streams since it saves expensive syscalls.
I also ran the same benchmark on this PR. This improves the performance notably. Perhaps, run your benchmarks on this and see if it helps. We plan to merge it in soon.

A small remark on your benchmark code:
var dur is accessed unsafely by multiple go-routines. You could perhaps change it to int64 and update it atomically.

Also, I recommend looking at our benchmark code

@janardhan1993

This comment has been minimized.

Copy link

janardhan1993 commented Sep 12, 2017

@MakMukhi I ran the benchmark code on loopy_writer branch and there wasn't much difference when compared to master.

As you can see with 1000 concurrent calls, request and response size of 1kb and no round trip latency, average time for grpc is 44ms which is a lot.

We use grpc in our database and latency is critical for reads and getting 1000 concurrent requests is quite normal in production. We use grpc for communicating between nodes in database(if running in cluster mode). We usually have a single connection between nodes and do all rpcs over it. What do you suggest we do to achieve good latency.(Like having more connections or setting some client options etc ?)

/usr/bin/time -v go run main.go -benchtime=10s -workloads=all  -maxConcurrentCalls=1000 -traceMode=false -reqSizeBytes=1000 -respSizeBytes=1000 -latency=0s -kbps=0 -mtu=0
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1000B-respSize_1000B-Compressor_false:
  226017	     44287 ns/op    26879 B/op	     190 allocs/op
Histogram (unit: ms)
Count: 226017  Min:   7.7  Max: 175.9  Avg: 44.25
------------------------------------------------------------
[        7.740154,         7.740155)       1    0.0%    0.0%  
[        7.740155,         7.740162)       0    0.0%    0.0%  
[        7.740162,         7.740221)       0    0.0%    0.0%  
[        7.740221,         7.740706)       0    0.0%    0.0%  
[        7.740706,         7.744682)       0    0.0%    0.0%  
[        7.744682,         7.777295)       0    0.0%    0.0%  
[        7.777295,         8.044820)       0    0.0%    0.0%  
[        8.044820,        10.239320)      24    0.0%    0.0%  
[       10.239320,        28.240719)     629    0.3%    0.3%  
[       28.240719,       175.905518)  225362   99.7%  100.0%  ##########
[      175.905518,              inf)       1    0.0%  100.0%  

Stream-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1000B-respSize_1000B-Compressor_false
latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1000B-respSize_1000B-Compressor_false
  491891	     20349 ns/op    17133 B/op	      37 allocs/op
Histogram (unit: µs)
Count: 491891  Min:  77.4  Max: 77139.2  Avg: 20329.41
------------------------------------------------------------
[      77.353000,       77.354000)       1    0.0%    0.0%  
[      77.354000,       77.360522)       0    0.0%    0.0%  
[      77.360522,       77.409576)       0    0.0%    0.0%  
[      77.409576,       77.778546)       0    0.0%    0.0%  
[      77.778546,       80.553826)       0    0.0%    0.0%  
[      80.553826,      101.428625)       0    0.0%    0.0%  
[     101.428625,      258.442420)       7    0.0%    0.0%  
[     258.442420,     1439.451739)      13    0.0%    0.0%  
[    1439.451739,    10322.639411)    4774    1.0%    1.0%  
[   10322.639411,    77139.236000)  487095   99.0%  100.0%  ##########
[   77139.236000,             inf)       1    0.0%  100.0%  

	Command being timed: "go run main.go -benchtime=10s -workloads=all -maxConcurrentCalls=1000 -traceMode=false -reqSizeBytes=1000 -respSizeBytes=1000 -latency=0s -kbps=0 -mtu=0"
	User time (seconds): 74.90
	System time (seconds): 4.77
	Percent of CPU this job got: 372%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.37
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 123668
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 104139
	Voluntary context switches: 155742
	Involuntary context switches: 8481
	Swaps: 0
	File system inputs: 0
	File system outputs: 19088
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@MakMukhi

This comment has been minimized.

Copy link
Contributor

MakMukhi commented Sep 13, 2017

On my machine the master branch shows latency(for 2000 concurrent streams) to be 107ms

  go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=2000 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0   -cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_2000-reqSize_1024B-respSize_1024B-Compressor_false:
  187049	     53730 ns/op    23579 B/op	     143 allocs/op
Histogram (unit: µs)
Count: 187049  Min: 656.6  Max: 265936.6  Avg: 107151.64
------------------------------------------------------------
[      656.573000,       656.574000)       1    0.0%    0.0%  
[      656.574000,       656.581629)       0    0.0%    0.0%  
[      656.581629,       656.647462)       0    0.0%    0.0%  
[      656.647462,       657.215542)       0    0.0%    0.0%  
[      657.215542,       662.117581)       0    0.0%    0.0%  
[      662.117581,       704.417928)       2    0.0%    0.0%  
[      704.417928,      1069.433260)       0    0.0%    0.0%  
[     1069.433260,      4219.199233)       0    0.0%    0.0%  
[     4219.199233,     31398.949825)      61    0.0%    0.0%  
[    31398.949825,    265936.642000)  186984  100.0%  100.0%  ##########
[   265936.642000,              inf)       1    0.0%  100.0%  

While the loopy_writer branch show latency of 64ms for the same:

 go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=2000 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0   -cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_2000-reqSize_1024B-respSize_1024B-Compressor_false:
  308610	     32460 ns/op    27690 B/op	     190 allocs/op
Histogram (unit: ms)
Count: 308610  Min:   3.7  Max: 208.1  Avg: 64.81
------------------------------------------------------------
[        3.675335,         3.675336)       1    0.0%    0.0%  
[        3.675336,         3.675343)       0    0.0%    0.0%  
[        3.675343,         3.675405)       0    0.0%    0.0%  
[        3.675405,         3.675924)       0    0.0%    0.0%  
[        3.675924,         3.680273)       0    0.0%    0.0%  
[        3.680273,         3.716730)       0    0.0%    0.0%  
[        3.716730,         4.022339)       0    0.0%    0.0%  
[        4.022339,         6.584198)       7    0.0%    0.0%  
[        6.584198,        28.059752)     601    0.2%    0.2%  
[       28.059752,       208.085003)  308000   99.8%  100.0%  ##########
[      208.085003,              inf)       1    0.0%  100.0%  

But I understand that you're looking for better latency values.
Since you do not care about throughput, how about limiting number of concurrent streams to 100, the latency in such a scenario is about 5ms:

go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=100 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0   -cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1024B-respSize_1024B-Compressor_false:
  186687	     53590 ns/op    23374 B/op	     145 allocs/op
Histogram (unit: µs)
Count: 186687  Min: 672.2  Max: 12310.5  Avg: 5357.82
------------------------------------------------------------
[     672.227000,      672.228000)       1    0.0%    0.0%  
[     672.228000,      672.233097)       0    0.0%    0.0%  
[     672.233097,      672.264170)       0    0.0%    0.0%  
[     672.264170,      672.453619)       0    0.0%    0.0%  
[     672.453619,      673.608640)       0    0.0%    0.0%  
[     673.608640,      680.650517)       2    0.0%    0.0%  
[     680.650517,      723.583115)       3    0.0%    0.0%  
[     723.583115,      985.332614)      12    0.0%    0.0%  
[     985.332614,     2581.155000)    1286    0.7%    0.7%  
[    2581.155000,    12310.492000)  185382   99.3%  100.0%  ##########
[   12310.492000,             inf)       1    0.0%  100.0%  

You can do this by setting this server parameter: grpc.MaxConcurrentStreams

@manishrjain

This comment has been minimized.

Copy link

manishrjain commented Sep 13, 2017

So, to achieve low latency, what do you recommend? Should we create multiple connections, and round robin over them for every request? 5ms just to do a heartbeat (in Raft) for e.g., is still too much -- we should be going into micro second durations for these cases where both the servers are in the same datacenter.

@MakMukhi

This comment has been minimized.

Copy link
Contributor

MakMukhi commented Sep 13, 2017

Why are more than one concurrent RPCs needed if it's only a heartbeat?
For a single call on on a connection, the latency is actaully 193 us

 go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=1 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0   -cpuProfile=cpuProf -memProfile=memProf -memProfileRate=10000
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1-reqSize_1024B-respSize_1024B-Compressor_false:
   51275	    195028 ns/op    23465 B/op	     214 allocs/op
Histogram (unit: µs)
Count: 51275  Min: 131.3  Max: 2684.4  Avg: 194.82
------------------------------------------------------------
[    131.280000,     131.281000)      1    0.0%    0.0%  
[    131.281000,     131.285151)      0    0.0%    0.0%  
[    131.285151,     131.306533)      0    0.0%    0.0%  
[    131.306533,     131.416676)      0    0.0%    0.0%  
[    131.416676,     131.984026)      0    0.0%    0.0%  
[    131.984026,     134.906488)      5    0.0%    0.0%  
[    134.906488,     149.960282)    666    1.3%    1.3%  
[    149.960282,     227.503392)  45713   89.2%   90.5%  #########
[    227.503392,     626.933186)   4880    9.5%  100.0%  #
[    626.933186,    2684.423000)      9    0.0%  100.0%  
[   2684.423000,            inf)      1    0.0%  100.0%  


We do realize that the critical path has a lot of contention and all the synchronization code makes latency worse as the number of concurrent calls increase on a connection. We are working on reducing this so these numbers should get better in the near future.

About having multiple connections vs 1: What's the use case that you are optimizing? Is it a heartbeat mechanism? Or a service that sends a query to a server and gets response back? A mixture of both?

In case it's 1: I would go with one connection and limit the number of concurrent heartbeats to 1. However I do recommend looking into our keepalive feature. This essentially makes sure that a connection is active(i.e. the other side is alive)
In case it's 2: If it's the same backend server, I wouldn't really suggest opening another connection to it, rather limit the number of concurrent calls.
In case it's 3( The client queries the server and also runs a heartbeat mechanism): I would recommend the same as case 2 with keepalive turned on.

@janardhan1993

This comment has been minimized.

Copy link

janardhan1993 commented Sep 13, 2017

Strangely i get more throughput if i set maxConcurrentCalls to 100 when compared to 1000 on my machine. In 10 seconds the count is 277203 when we have 100 connections and is 215057 when we have 1000 concurrent connections.

go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=1000 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1024B-respSize_1024B-Compressor_false:
  215057	     46572 ns/op    28423 B/op	     190 allocs/op
Histogram (unit: ms)
Count: 215057  Min:  11.9  Max: 128.1  Avg: 46.52
------------------------------------------------------------
[       11.890526,        11.890527)       1    0.0%    0.0%  
[       11.890527,        11.890534)       0    0.0%    0.0%  
[       11.890534,        11.890588)       0    0.0%    0.0%  
[       11.890588,        11.891014)       0    0.0%    0.0%  
[       11.891014,        11.894368)       0    0.0%    0.0%  
[       11.894368,        11.920773)       3    0.0%    0.0%  
[       11.920773,        12.128659)       0    0.0%    0.0%  
[       12.128659,        13.765331)       5    0.0%    0.0%  
[       13.765331,        26.650745)     574    0.3%    0.3%  
[       26.650745,       128.096754)  214473   99.7%  100.0%  ##########
[      128.096754,              inf)       1    0.0%  100.0%  

➜  grpc git:(loopy_writer) go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=100 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0 
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_100-reqSize_1024B-respSize_1024B-Compressor_false:
  277203	     36081 ns/op    28211 B/op	     189 allocs/op
Histogram (unit: µs)
Count: 277203  Min: 172.5  Max: 24015.4  Avg: 3606.44
------------------------------------------------------------
[     172.491000,      172.492000)       1    0.0%    0.0%  
[     172.492000,      172.497602)       0    0.0%    0.0%  
[     172.497602,      172.534593)       0    0.0%    0.0%  
[     172.534593,      172.778819)       0    0.0%    0.0%  
[     172.778819,      174.391317)       0    0.0%    0.0%  
[     174.391317,      185.037789)       0    0.0%    0.0%  
[     185.037789,      255.330810)       1    0.0%    0.0%  
[     255.330810,      719.438459)     102    0.0%    0.0%  
[     719.438459,     3783.695846)  172890   62.4%   62.4%  ######
[    3783.695846,    24015.367000)  104208   37.6%  100.0%  ####
[   24015.367000,             inf)       1    0.0%  100.0%  

I tried setting maxConcurrentStreams to 100 in server opts in benchmark and ran with 1000 concurrent requests but the latency was still high.

go run benchmark/benchmain/main.go -benchtime=10s -workloads=unary   -compression=off -maxConcurrentCalls=1000 -traceMode=false   -reqSizeBytes=1024 -respSizeBytes=1024   -latency=0s -kbps=0 -mtu=0
Unary-noTrace-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1024B-respSize_1024B-Compressor_false:
  184611	     54420 ns/op    28246 B/op	     189 allocs/op
Histogram (unit: µs)
Count: 184611  Min: 222.3  Max: 89047.7  Avg: 54298.48
------------------------------------------------------------
[     222.280000,      222.281000)       1    0.0%    0.0%  
[     222.281000,      222.287641)       0    0.0%    0.0%  
[     222.287641,      222.338390)       0    0.0%    0.0%  
[     222.338390,      222.726182)       0    0.0%    0.0%  
[     222.726182,      225.689441)       0    0.0%    0.0%  
[     225.689441,      248.332771)       0    0.0%    0.0%  
[     248.332771,      421.358659)       0    0.0%    0.0%  
[     421.358659,     1743.512134)       0    0.0%    0.0%  
[    1743.512134,    11846.565689)       9    0.0%    0.0%  
[   11846.565689,    89047.657000)  184600  100.0%  100.0%  ##########
[   89047.657000,             inf)       1    0.0%  100.0%  
@manishrjain

This comment has been minimized.

Copy link

manishrjain commented Sep 13, 2017

It's not just heartbeat. We have one connection per pair of servers, so it's a mix of everything. All the communication flows through the same connection. Heartbeats, small requests, big requests, etc. We just push everything onto one connection at the moment. So, heartbeats, plus queries to other server, with payload sizes ranging from small to big.

Some of the issues we've noticed is that for small sized requests, the latency is unexpectedly high; presumably because of the client side batching that's going on.

Will look into keepalive. (On a side note, if it can give us a way to test if a connection is active, we can use it to avoid reference counting to determine if a connection can be closed.)

@janardhan1993 is trying with setting the number of concurrent calls by setting max concurrent streams. Looks like latency is still an issue there as well. Let us know if we should do this at our end instead, by not allowing more than X number of calls on the same connection.

@MakMukhi

This comment has been minimized.

Copy link
Contributor

MakMukhi commented Sep 13, 2017

I see. I suppose I was wrong in expecting that a server throttling of number of concurrent streams would show better results. It shouldn't. The client still creates all 1000 streams but sends out only 100 at a time, thus the client side contention isn't really reduced.

My opinion is that the high latency numbers are because of client-side contention and not batching. Consider a case where no batching happens, if a lot of big messages get to write before a small one, even then the small message RPC will see high latency. Furthermore, loopy_writer branch that improves batching show performance improvement.

Creating multiple connections might help with the contention but I can't say that confidently since we don't have benchmarks for such a scenario. I can work on putting something together this week to test it out.

About keepalive, if it finds that a connection is no longer active it closes the connection itself. At such a point all FailFast RPCs should fail.

@MakMukhi

This comment has been minimized.

Copy link
Contributor

MakMukhi commented Dec 18, 2017

Wanted to update this issue with the current state of master:

mmukhi@mmukhi:~/go_code/src/google.golang.org/grpc$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary   -compression=off -maxConcurrentCalls=1000 -trace=off   -reqSizeBytes=1024 -respSizeBytes=1024 -networkMode=Local -cpuProfile=master.cpu
Unary-traceMode_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_1000-reqSize_1024B-respSize_1024B-Compressor_false: 
50_Latency: 19.3949 ms 	90_Latency: 22.7358 ms 	99_Latency: 27.6315 ms 	Avg latency: 16.8282 ms 	Count: 3565323 	24009 Bytes/op	134 Allocs/op	
Histogram (unit: ms)
Count: 3565323  Min:   1.5  Max:  86.4  Avg: 16.83
------------------------------------------------------------
[       1.530619,        1.530620)        1    0.0%    0.0%  
[       1.530620,        1.530627)        0    0.0%    0.0%  
[       1.530627,        1.530677)        0    0.0%    0.0%  
[       1.530677,        1.531058)        0    0.0%    0.0%  
[       1.531058,        1.533959)        0    0.0%    0.0%  
[       1.533959,        1.556014)        0    0.0%    0.0%  
[       1.556014,        1.723683)        4    0.0%    0.0%  
[       1.723683,        2.998366)     3651    0.1%    0.1%  
[       2.998366,       12.689008)   911234   25.6%   25.7%  ###
[      12.689008,       86.361053)  2650432   74.3%  100.0%  #######
[      86.361053,             inf)        1    0.0%  100.0%  

We're working on further improving concurrency and will keep updating this issue as we move along.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.