Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency problem with 1k req/sec #50

Open
zavada opened this issue Oct 9, 2014 · 8 comments
Open

Latency problem with 1k req/sec #50

zavada opened this issue Oct 9, 2014 · 8 comments

Comments

@zavada
Copy link

zavada commented Oct 9, 2014

I'm using 2.8.17 redis server.
Send requests throw unix.sock by nginx+lua. Each request is one SMEMBERS command. Now I have about 1k req per sec. In the nginx access log I see mostly next

[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.670 : 0.002 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.671 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.672 : 0.000 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.673 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.673 : 0.001 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.674 : 0.000 : 198 ] 212.1.66.241 - - "POST     /API/0.1/url/getCategory/ HTTP/1.1" 200 37 
[09/Oct/2014:18:04:49 +0300] [ 1412867089.681 : 0.000 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 

You can see that request time is not more than 2ms. But almost every second I have a few requests with request time about 100ms

[09/Oct/2014:18:05:22 +0300] [ 1412867122.843 : 0.131 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:05:23 +0300] [ 1412867123.212 : 0.100 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29 
[09/Oct/2014:18:05:23 +0300] [ 1412867123.613 : 0.101 : 190 ] 212.1.66.241 - - "POST /API/0.1/url/getCategory/ HTTP/1.1" 200 29

I can't find the problem place.
I checked everything from the latency guide
Slowlog has no entries
System is not swapping
I don't use AOF

I read a lot of posts on stackoverflow like this
http://stackoverflow.com/questions/16841469/redis-performance-tuning/23719773#23719773
and I can't find the solution.

redis-benchmark gives me next

redis-benchmark -s /tmp/redis.sock -q -n 10000 -c 1 -d 50
PING_INLINE: 51282.05 requests per second
PING_BULK: 64935.07 requests per second
SET: 39840.64 requests per second
GET: 28901.73 requests per second
INCR: 26525.20 requests per second
LPUSH: 29411.76 requests per second
LPOP: 23255.81 requests per second
SADD: 21551.72 requests per second
SPOP: 23041.47 requests per second
LPUSH (needed to benchmark LRANGE): 45045.04 requests per second
LRANGE_100 (first 100 elements): 10101.01 requests per second
LRANGE_300 (first 300 elements): 4196.39 requests per second
LRANGE_500 (first 450 elements): 3337.78 requests per second
LRANGE_600 (first 600 elements): 7320.64 requests per second
MSET (10 keys): 62111.80 requests per second
@agentzh
Copy link
Member

agentzh commented Oct 9, 2014

@zavada What nginx variable are you using for the "request time" field in your access log? Is it $request_time? If yes, then the time specified by the $request_time variable includes the network I/O time between the client and your web server, which can be relatively large for slow downstream networks.

@zavada
Copy link
Author

zavada commented Oct 10, 2014

@agentzh Yes, I use $request_time.
But if I comment the line

local categories, err = red:smembers("categories:"..clear_url)

I don't have in the result this >100ms time requests. Thats why I think that problem is not in network I/O time between the client and my web server.

Maybe do you have other idea?

@agentzh
Copy link
Member

agentzh commented Oct 10, 2014

@zavada Then you'll have to use tools like systemtap and dtrace to grab more details from those slow request samples in various places: nginx processes, kernel network stack, and the redis-server process. See my sample tools based on systemtap in the nginx-systemtap-toolkit and stapxx projects on GitHub for some ideas.

@agentzh
Copy link
Member

agentzh commented Oct 10, 2014

@zavada Also, $request_time may have some error due to the time caching mechanism in the nginx core. Use the ngx-lua-tcp-recv-time tool in my stapxx project to measure the upstream cosocket recv latency time accurately at real time:

https://github.com/openresty/stapxx#ngx-lua-tcp-recv-time

@agentzh
Copy link
Member

agentzh commented Oct 10, 2014

@zavada Furthermore, watch the CPU usage of your nginx worker processes, if their CPU usage can hit 100% from time to time, then you might just run out of CPU time and such intermittent long latency is totally reasonable. Similarly, you should watch the CPU usage of your redis-server process as well.

If any process is exhausting the CPU time, then you should use the various flame graph sampling tools to analyze the bottleneck:

https://github.com/openresty/nginx-systemtap-toolkit#sample-bt

https://github.com/openresty/stapxx#lj-lua-stacks

Regards,
-agentzh

@agentzh
Copy link
Member

agentzh commented Oct 10, 2014

@zavada Sometimes, nginx may just block on some blocking I/O system calls (like file I/O) or semaphores used as some internal locks in nginx. Such blocking things can contribute to your cosocket latency as well. You can just the epoll-loop-blocking-distr tool to verify this:

https://github.com/openresty/stapxx#epoll-loop-blocking-distr

and further use the off-CPU time flame graph tool to analyze the causes:

https://github.com/openresty/nginx-systemtap-toolkit#sample-bt-off-cpu

@zavada
Copy link
Author

zavada commented Oct 12, 2014

@agentzh When I try to run your examples I have next error
Checking "/lib/modules/2.6.32-431.el6.x86_64/build/.config" failed with error: No such file or directory
On my centos server I have "/lib/modules/2.6.32-431.el6.x86_64/build/" but don't ".config" inside. Do you know how I can fix it?

Thank you for your time and help.

@agentzh
Copy link
Member

agentzh commented Oct 12, 2014

Hello!

On Sun, Oct 12, 2014 at 2:34 AM, zavada wrote:

@agentzh When I try to run your examples I have next error
Checking "/lib/modules/2.6.32-431.el6.x86_64/build/.config" failed with error: No such file or directory
On my centos server I have "/lib/modules/2.6.32-431.el6.x86_64/build/" but don't ".config" inside. Do you know how I can fix it?

Have you installed the "kernel-devel" and "kernel-debuginfo" packages
for your kernel actually being used?

See the following documentation for more details:

https://www.sourceware.org/systemtap/SystemTap_Beginners_Guide/using-systemtap.html#install-kinfo

Regards,
-agentzh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants