Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis connection issues #27

Closed
urosgruber opened this issue Nov 11, 2013 · 24 comments
Closed

Redis connection issues #27

urosgruber opened this issue Nov 11, 2013 · 24 comments

Comments

@urosgruber
Copy link

Hi,

I'm doing some redis lookups in my lua script and I'm getting some weird connection problems which I don't know how to properly debug to actually get to the source of the problem.

Here is snip from config

- Connect to Redis
local redis = require "resty.redis"
local red = redis:new()

red:set_timeout(1000)
local ok, err = red:connect("127.0.0.1", 6379)

if not ok then
    ngx.log(ngx.ERR, "Redis connection failure: " .. err)
    return
end

- rest of the code ...

On a random basis I get a lot of Redis connection failure: timeout in nginx error log. While debug loging redis server I don't see any connections at that same moment.

I'm using latest lua-resty-redis and 0.8.6 ngx_lua module. I also try to upgrade to 0.9.2 but then I get even more strange errors

2013/11/10 23:24:53 [error] 52514#0: *15 attempt to send data on a closed socket: u:0000000000000000, c:0000000000000000, ft:0 eof:0, 
2013/11/10 23:24:53 [error] 52514#0: *15 attempt to send data on a closed socket: u:0000000000000000, c:0000000000000000, ft:0 eof:0, 
2013/11/10 23:24:53 [error] 52514#0: *15 attempt to send data on a closed socket: u:0000000000000000, c:0000000000000000, ft:0 eof:0, 
2013/11/10 23:24:53 [error] 52514#0: *15 attempt to send data on a closed socket: u:0000000000000000, c:0000000000000000, ft:0 eof:0, 

I don't know if this one is actually related. Is there anything I can try to get more detail info where things actually broke. Server is FreeBSD with pf enabled, nginx 1.5.x.

Regards,

Uros

@urosgruber
Copy link
Author

If it helps anything. Here is tcpdump output

01:55:56.845104 IP (tos 0x0, ttl 64, id 54246, offset 0, flags [none], proto TCP (6), length 60, bad cksum 0 (->a8d3)!)
    localhost.24595 > localhost.6379: Flags [S], cksum 0x894f (correct), seq 1246980252, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1139938987 ecr 0], length 0
01:55:56.845216 IP (tos 0x0, ttl 64, id 42741, offset 0, flags [none], proto TCP (6), length 60, bad cksum 0 (->d5c4)!)
    localhost.6379 > localhost.24595: Flags [S.], cksum 0xaee1 (correct), seq 24767610, ack 1246980253, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 2011919485 ecr 1139938987], length 0
01:55:56.845227 IP (tos 0x0, ttl 64, id 24903, offset 0, flags [none], proto TCP (6), length 40, bad cksum 0 (->1b87)!)
    localhost.24595 > localhost.6379: Flags [R], cksum 0x85ef (correct), seq 1246980253, win 0, length 0
01:57:01.350634 IP (tos 0x0, ttl 64, id 39193, offset 0, flags [none], proto TCP (6), length 60, bad cksum 0 (->e3a0)!)
    localhost.34341 > localhost.6379: Flags [S], cksum 0xfe1c (correct), seq 1362524936, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1384207341 ecr 0], length 0
01:57:01.350764 IP (tos 0x0, ttl 64, id 12705, offset 0, flags [none], proto TCP (6), length 60, bad cksum 0 (->4b19)!)
    localhost.6379 > localhost.34341: Flags [S.], cksum 0x4a5a (correct), seq 2189045235, ack 1362524937, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1923387551 ecr 1384207341], length 0
01:57:01.350781 IP (tos 0x0, ttl 64, id 22800, offset 0, flags [none], proto TCP (6), length 52, bad cksum 0 (->23b2)!)
    localhost.34341 > localhost.6379: Flags [.], cksum 0xae4e (correct), seq 1, ack 1, win 1275, options [nop,nop,TS val 1384207341 ecr 1923387551], length 0
01:57:01.350868 IP (tos 0x0, ttl 64, id 64981, offset 0, flags [none], proto TCP (6), length 67, bad cksum 0 (->7edd)!)
    localhost.34341 > localhost.6379: Flags [P.], cksum 0xf8c5 (correct), seq 1:16, ack 1, win 1275, options [nop,nop,TS val 1384207341 ecr 1923387551], length 15
01:57:01.351023 IP (tos 0x0, ttl 64, id 42171, offset 0, flags [none], proto TCP (6), length 57, bad cksum 0 (->d801)!)
    localhost.6379 > localhost.34341: Flags [P.], cksum 0x2dd6 (correct), seq 1:6, ack 16, win 1275, options [nop,nop,TS val 1923387551 ecr 1384207341], length 5

The first one is timeout, second is ok. As you see there is flag R (reset). Don't know why this is actually sent.

@agentzh
Copy link
Member

agentzh commented Nov 11, 2013

@urosgruber Several notes:

  1. It's very possible that the connect() requests from your Nginx exceed the hard-coded backlog limit in the Redis server (which is a very small value, 511). Because it is your kernel dropping (SYN) packets, you won't see the dropped connections in your Redis's debug logs (which are on the user-land). You can try increasing the backlog limit in your Redis server or just scaling your Redis server.
  2. When the SYN packet is dropped, re-sending the packet can easily exceed your timeout setting in your Lua code (that is, 1 second).
  3. When you get the "attempt to send data on a closed socket" error message that means that the socket is closed due to an earlier error. You should see the actual error message before these error messages. Basically you should stop using the current socket if you see an error.
  4. If you already handle the socket errors yourself in your Lua code, it's recommended to turn off the lua_socket_log_errors directive: https://github.com/chaoslawful/lua-nginx-module#lua_socket_log_errors
  5. Because you are on the FreeBSD system, you cannot directly use the diagnosing tools in my Nginx Systemtap Toolkit and stap++. You may consider porting some of the relevant tools in these two projects over to FreeBSD's dtrace port:

https://github.com/agentzh/nginx-systemtap-toolkit#readme

https://github.com/agentzh/stapxx#samples

@agentzh
Copy link
Member

agentzh commented Nov 11, 2013

And I should add that, when connect() is timed out, nginx will close the current connection right away and subsequent (late) response packets from the Redis server side will yield RST.

@urosgruber
Copy link
Author

  1. Redis server is actualy doing nothing at all. This is test server when only client is me. Would this still be the problem with backlog limit?
  2. I'll try with 10s
  3. I add this setting to Off

Btw is it better to have set_keepalive for redis client or to disable keepalive?

@agentzh
Copy link
Member

agentzh commented Nov 11, 2013

Hello!

On Sun, Nov 10, 2013 at 5:36 PM, Uroš Gruber wrote:

Redis server is actualy doing nothing at all. This is test server
when only client is me. Would this still be the problem with backlog
limit?

Yes, of course. This has actually been a commen problem :) The Redis
server is a single-threaded process while Nginx can have multiple
worker processes, which means Nginx can usually have way more CPU
resources to exhaust the Redis server.

Btw is it better to have set_keepalive for redis client or to disable keepalive?

Enabling the Redis connection pool on the client side (i.e., in Nginx)
can save the overhead of TCP handshakes during connect() and close(),
which can definitely help. But it's still possible (and easy) to
exhaust the Redis server with Redis queries BTW.

Regards,
-agentzh

@urosgruber
Copy link
Author

With timeout set to 10s things are looking better. I also check that backlog limit I have on this server is probably a bit low kern.ipc.somaxconn: 128

and checking netstat -Lan

gives me this

tcp4  0/0/128        192.168.168.150.80
tcp4  0/0/128        172.20.0.1.5000
tcp4  0/0/128        172.20.0.3.5000
tcp4  0/0/128        127.0.0.1.6379

80 is frontend proxy nginx and at 5000 are backend servers.

@urosgruber
Copy link
Author

Hm, and the issue is back. Server was in idle. There was no open connections, no pf states. When I refreshed the webpage I received 500 errror, checked the log and again timeout problem.

@urosgruber
Copy link
Author

I also try socket configuration and connection throug /tmp/redis.sock and it works without any issues.

Btw increased kern.ipc.somaxconn=1024 does make changes to how much backlog limit is set for redis and it's now 512, but still does not help. Random timeouts on idle server still happened.

@agentzh
Copy link
Member

agentzh commented Nov 11, 2013

Another possibility for timeout issues is that your Nginx worker process's event loop is heavily blocked by something. You can try porting the epoll-loop-blocking-distr tool over to kqueue and FreeBSD:

https://github.com/agentzh/stapxx#epoll-loop-blocking-distr

and the off-CPU flamegraph tool over FreeBSD's dtrace too:

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt-off-cpu

And it's wise to monitor the latency in the accept() and recv() queues in your kernel, for example, see the tools for Linux:

https://github.com/agentzh/nginx-systemtap-toolkit#tcp-accept-queue

https://github.com/agentzh/nginx-systemtap-toolkit#tcp-recv-queue

Also you can write your own custom dtrace scripts to trace various user-land and kernel-space events and socket states associated with those timeout connections on-the-fly on your side.

Also please distinguish timeout errors in different operations like connect(), send(), and receive(). They mean different things :)

@urosgruber
Copy link
Author

For now I'll try to stick with /tmp/redis.sock becaue I don't have enough knowlage for DTrace. Btw a lot of nice tools you have there :)

I also need to try the same setup with some other servers.

@xidianwlc
Copy link

你好,春哥:
我最近测试redis连接池的部分,直接使用local ok, err = red:set_keepalive(10000, 100) 后进行ab压力测试,发现times, err = red:get_reused_times() 这个方法返回一直是times=0,日志一直报错:lua tcp socket connect timed out, client: 192.168.6.7 ,查看测试服务器到redis端口的连接数发现特别多,初步怀疑lua 的redis 连接池是不是有问题。我测试服务器为centos6.5 , nginx使用的是openresty最新版

@agentzh
Copy link
Member

agentzh commented Jul 2, 2015

@xidianwlc Ah, please no Chinese here. This place is English only. If you really want to use Chinese, please join the openresty (Chinese) mailing list instead. Please see https://openresty.org/#Community

Regarding your question, it's already an FAQ. See my (Chinese) replies (to similar questions from others) on the aforementioned mailing list:

https://groups.google.com/d/msg/openresty/e-r69KtAWek/wJ3cdzxluhUJ

https://groups.google.com/d/msg/openresty/h3l6jAo3aD0/UvQGlF77cUwJ

@agentzh agentzh closed this as completed Jul 2, 2015
@xidianwlc
Copy link

hi, agentzh:
I try to use "local ok, err = red:set_keepalive(10000, 100)" in lua script and use "times, err = red:get_reused_times()" to check whether if the connection pool was used . But unexpected , times were always 0 and the error info example are "lua tcp socket connect timed out, client: 192.168.6.7 " .my nginx version is ngx_openresty-1.7.10.1 .
mysql test tool is ab and the command is ab -r -k -n 20000 -c 10000 "http://192.168.6.7:8088/redis/mget?key=bk_1006725"

@agentzh
Copy link
Member

agentzh commented Jul 2, 2015

@xidianwlc When you see "connect timed out", it usually means

  1. your redis server is too busy to accept new connections and the redis server's TCP stack is dropping packets when the accept() queue is full, or
  2. your nginx server is too busy to respond to new I/O events (like doing CPU intensive computations or blocking on system calls)

You can consider

  1. scale your redis backend to multiple nodes so as to utilize more CPU resources,
  2. increase the backlog setting in your redis.conf file (the default is very small, which determines the length limit of your accept() queue on your redis server).
  3. check whether your nginx server is too busy doing CPU intensive work or blocking system calls (like disk I/O syscalls) by using the flame graph tools: https://openresty.org/#Profiling
  4. increase the timeout limit in your Lua code. Maybe you specify a too short timeout value on the client side?
  5. automatically retry connect() in your Lua code for one more time when your connect() call fails.

@agentzh
Copy link
Member

agentzh commented Jul 2, 2015

@xidianwlc Regarding the get_reused_times always returning 0 problem, have you checked the return values of set_keepalive? You should always do proper error handling in Lua there (and elsewhere).

My hunch is that you always fail to put your connections back to the connection pool in the first place so that your pool is always empty.

@xidianwlc
Copy link

@agentzh
local ok, err = red:set_keepalive(10000, 800)
if not ok then
FORMAT_RETURN.msg="failed to set keepalive: "..err
local value=cjson.encode(FORMAT_RETURN)
ngx.say(value)
ngx.log(ngx.ERR, value)
return
end

there were no nginx error log for this

@xidianwlc
Copy link

@agentzh I find some articles from https://groups.google.com/forum/#!topic/openresty/h3l6jAo3aD0
in the article you said if connections are bigger than connection pool ,it will use short connection

@agentzh
Copy link
Member

agentzh commented Jul 2, 2015

@xidianwlc Isn't it already stated in the mailing list posts I mentioned earlier?

https://groups.google.com/d/msg/openresty/e-r69KtAWek/wJ3cdzxluhUJ

https://groups.google.com/d/msg/openresty/h3l6jAo3aD0/UvQGlF77cUwJ

Alas. you didn't read my comments carefully.

@agentzh
Copy link
Member

agentzh commented Jul 2, 2015

@xidianwlc It's also officially documented:

https://github.com/openresty/lua-nginx-module#tcpsocksetkeepalive

"When the connection pool exceeds the available size limit, the least recently used (idle) connection already in the pool will be closed to make room for the current connection."

@RobinTsai
Copy link

I also met the error info "lua tcp socket connect timed out, when connecting to xxx:xxx"(xxx:xxx is the master redis server). We group had been working on it for several days. I checked a lot issues but there were no resolution. At last we used connection pool to decrease this error - but not fixed it.

It's on production env, and I'm sure the configure is no mistaken. On 3 different prod envs, one gived big chances of this error msg, the others are rarely. This one only have 2 cores (another one 8 cores and used 4 cores for thie nginx) but other sys config are totally same. And also same as redis. There are some info:

  1. We set 100ms connect timeout when connect to redis master. When we adjust it up to 5s, the error just be less but not missing. (the other envs also set 100ms connection timeout, but there are not so much errors)
  2. I capture the TCP (only when the connect timeout setting is 100ms), there are all normal TCP connection record: 3-handshakes, select db, get data, 4-waves. (Maybe the timeout is much little there are no retransmission. So I even thought the cpu just not process this event for the timeout-setting time.)

@zhuizhuhaomeng
Copy link
Contributor

is the timeout issue occurred when you capture the TCP?

@RobinTsai
Copy link

Nop, there were not any clue for connection timeout issue.

There were no SYN retransmition. And no packet just abnormal at SYN.

@zhuizhuhaomeng
Copy link
Contributor

What did you mean by "abnormal at SYN"?

@RobinTsai
Copy link

If there were connect timeout, I thought, there might:

  1. Have SYN Retransmit packets. >>> But there were not.
  2. Have RST at the TCP stream. >>> But there were not.
  3. Or, there might have incomplete 3-handshakes at TCP stream. >>> But there were not.

The TCP streams what I saw all were normal as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants