Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Nginx overloaded while redis server unavailable #172

Closed
Fry-kun opened this Issue Oct 29, 2012 · 11 comments

Comments

Projects
None yet
2 participants

Fry-kun commented Oct 29, 2012

While restarting redis server, nginx CPU usage goes to 100% for all workers. I would've expected that since pretty small timeouts are set, nginx processes should not be affected...

Here's my config

nginx.conf:
...
upstream redis-metrics {
hash $arg_key;
server gw513:6391;
server gw514:6391;
keepalive 1024; # keepalive must come after hash!
}
...
location /redis_metrics_internal {
internal;
redis2_connect_timeout 5ms;
redis2_send_timeout 5ms;
redis2_read_timeout 10ms;
redis2_pass redis-metrics;
redis2_raw_query $echo_request_body;
}

sitemetrics.lua:
...
local qry={'EVAL', "local u=require('redis_utils'); return u.sitemetrics(KEYS,ARGV)", 6, ... }
ngx.location.capture("/redis_metrics_internal?key="..hex_ip, {method=ngx.HTTP_PUT, body=redis_parser_build_query(qry)})

Here are the errors that show up en masse while redis server is being restarted. Note: these are written to a ram drive (/dev/shm) so as not overload HDD with writes

2012/10/29 14:11:22 [error] 15033#0: *7354385 upstream timed out (110: Connection timed out) while connecting to upstream, client: 123.45.53.122, server: mysites, request: "GET /somepage HTTP/1.1", subrequest: "/redis_metrics_internal", upstream: "redis2://192.168.0.15:6391", host: "mysite.net", referrer: "http://mysite.net/"

2012/10/29 14:01:35 [error] 15034#0: *4681058 lua handler aborted: runtime error: /usr/local/nginx/lua/sitemetrics.lua:38: failed to issue subrequest: -1
stack traceback:
coroutine 0:

redis_version:2.9.7 (built from git)

nginx version: nginx/1.3.6
built by gcc 4.6.3 20120306 (Red Hat 4.6.3-2) (GCC)
TLS SNI support enabled
<way too many modules :)>

Owner

agentzh commented Oct 29, 2012

Hello!

On Mon, Oct 29, 2012 at 2:22 PM, Fry-kun notifications@github.com wrote:

While restarting redis server, nginx CPU usage goes to 100% for all
workers. I would've expected that since pretty small timeouts are set,
nginx processes should not be affected...

Could you use both pstack and strace to check what the worker processes are
exactly doing on both the userland and syscall levels?

Also, could you reproduce the issue without using the ngx_upstream_hash
module?

Is your nginx serving a lot of traffic while you restart the redis server?

Thanks!
-agentzh

Owner

agentzh commented Oct 29, 2012

Hello!

On Mon, Oct 29, 2012 at 2:22 PM, Fry-kun notifications@github.com wrote:

2012/10/29 14:01:35 [error] 15034#0: *4681058 lua handler aborted: runtime
error: /usr/local/nginx/lua/sitemetrics.lua:38: failed to issue subrequest:
-1

stack traceback:
coroutine 0:

BTW, this error message usually indicates RAM shortage. Were you running
out of memory at that time? Because you said you were writing error logs
into /dev/shm, it may have a big impact on your system memory, depending on
your configurations.

Best regards,
-agentzh

Fry-kun commented Oct 29, 2012

  • pstack returns nothing (do I need to recompile without stripping debug info?)
  • strace has a lot of output, will send email if you don't mind
  • will try without upstream module
  • yes, lots of traffic on the server

Definitely not out of system RAM:
Mem: 132017596k total, 81299744k used, 50717852k free, 1000488k buffers
Swap: 8191996k total, 0k used, 8191996k free, 46979548k cached

Owner

agentzh commented Oct 29, 2012

Hello!

On Mon, Oct 29, 2012 at 2:58 PM, Fry-kun notifications@github.com wrote:

  • pstack returns nothing (do I need to recompile without stripping
    debug info?)

Yes, you should not strip the debug symbols. And that's also why Nginx's
build system turn on -g for by default :)

  • strace has a lot of output, will send email if you don't mind

Yes, please :) Don't send me all, just tens of lines (no more than 100
lines) should be fine.

BTW, does your Nginx worker processes spin 100% forever? Or just for a
small while?

  • will try without upstream module

I mean the upstream hash module, not all the upstream modules :)

  • yes, lots of traffic on the server

Then I suggest you add proper failover logic to your Lua code.

Also, lua-resty-redis library is more efficient than doing subreqeusts to
ngx_redis2 (both CPU-wise and memory-wise):

https://github.com/agentzh/lua-resty-redis

It's also more flexible and feature complete. You can do redis backend
sharding and failover all in Lua yourself :)

Definitely not out of system RAM:
Mem: 132017596k total, 81299744k used, 50717852k free, 1000488k buffers
Swap: 8191996k total, 0k used, 8191996k free, 46979548k cached

Then are you seeing the following error message in your error.log?

subrequests cycle while processing "..."...

Best regards,
-agentzh

Owner

agentzh commented Oct 29, 2012

Hello!

On Mon, Oct 29, 2012 at 3:09 PM, agentzh agentzh@gmail.com wrote:

On Mon, Oct 29, 2012 at 2:58 PM, Fry-kun notifications@github.com wrote:

pstack returns nothing (do I need to recompile without stripping debug
info?)

Yes, you should not strip the debug symbols. And that's also why Nginx's
build system turn on -g for by default :)

Also, please run pstack for many times (10 or more times) and collect
all the outputs while your worker is spining :) (Yes, you're
essentially doing manual stack trace sampling :))

Best regards,
-agentzh

Fry-kun commented Oct 29, 2012

Yes, nginx workers stay at 100% forever (until redis server is restored)
You're right, there's actually a message saying "subrequests cycle"! That must be it!

Also, I didn't realize lua-resty-redis is more efficient/etc. -- I'll probably switch the config over soon :)

Fry-kun commented Oct 29, 2012

Unfortunately, setting redis2_next_upstream off didn't help -- still getting "subrequests cycle".. and it still takes ~140 tries for it to get there... that's weird

Owner

agentzh commented Oct 29, 2012

Hello!

On Mon, Oct 29, 2012 at 3:49 PM, Fry-kun notifications@github.com wrote:

Unfortunately, setting redis2_next_upstream off didn't help -- still
getting "subrequests cycle"..

redis2_next_upstream has nothing to do with the subrequest cycling error.

Maybe your Lua code can issue many subrequests in certain code path (like
when operations fail)? Can you reproduce it with a minimized configuration
and Lua code snippet in your development box?

Also, waiting for your pstack sampling results :)

Best regards,
-agentzh

Fry-kun commented Oct 30, 2012

Found the problem:
error_page 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 500 501 502 503 504 505 $request_uri; # hack to use nginx internal error pages

Stupid config that seemed to work with other pages for some reason -- but with these requests, nginx kept banging on redis2 subrequest (~140 times)

Got rid of this garbage and everything is fine now

P.S. pstack never returned anything, though gdb showed a stack trace without a problem...

Owner

agentzh commented Oct 30, 2012

Hello!

On Mon, Oct 29, 2012 at 7:04 PM, Fry-kun notifications@github.com wrote:

Found the problem:
error_page 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 500 501 502 503 504 505 $request_uri; # hack to use nginx internal error pages

Stupid config that seemed to work with other pages for some reason -- but with these requests, nginx kept banging on redis2 subrequest (~140 times)

Ah...okay...

Got rid of this garbage and everything is fine now

Cool :)

P.S. pstack never returned anything, though gdb showed a stack trace without a problem...

Interesting. pstack essentially invokes gdb behind the scene :)

Best regards,
-agentzh

Owner

agentzh commented Oct 30, 2012

I'm closing this :)

@agentzh agentzh closed this Oct 30, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment