New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: high CPU when idle w/ lots (thousands) of topics/consumers #236

Merged
merged 1 commit into from Sep 3, 2013

Conversation

Projects
None yet
4 participants
@mreiferson
Member

mreiferson commented Sep 2, 2013

Hi,

I ran a simple test to create thousands of topics with a single consumer each which led to nsqd bombing with the following error:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fff8cd2f212

and then a stack trace for, I guess, every running goroutine.

goroutine 1 [chan receive]:
main.main()
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/nsqd/main.go:145 +0xde6

goroutine 2 [syscall]:

goroutine 4 [syscall]:
os/signal.loop()
    /usr/local/Cellar/go/1.1.1/src/pkg/os/signal/signal_unix.go:21 +0x1c
created by os/signal.init·1
    /usr/local/Cellar/go/1.1.1/src/pkg/os/signal/signal_unix.go:27 +0x2f

goroutine 6 [chan receive]:
main.func·009()
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/nsqd/main.go:102 +0x39
created by main.main
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/nsqd/main.go:104 +0x697

goroutine 7 [select]:
main.(*NSQd).idPump(0xc20013a000)
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/nsqd/nsqd.go:353 +0x2bd
main.func·010()
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/nsqd/nsqd.go:92 +0x27
github.com/bitly/nsq/util.func·001()
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/util/wait_group_wrapper.go:14 +0x2c
created by github.com/bitly/nsq/util.(*WaitGroupWrapper).Wrap
    /var/folders/4q/k4wfn18j7v79zj7l2tqr_5p80000gn/T/nsqgopath.TQxJw0U2/src/github.com/bitly/nsq/util/wait_group_wrapper.go:16 +0xa6

[ ... etc ... ]

Here's the ruby program that triggered it:

require 'nsq'
reader = NSQ::Reader.new(:nsqd_tcp_addresses => '127.0.0.1:4150')
5000.times do |i|
  reader.subscribe("foo-#{i)}", "bar") do
  end
end
reader.run

This may not be a valid use case for nsq, but perhaps some throttling of topics may be necessary.

@michaelhood

This comment has been minimized.

Show comment
Hide comment
@michaelhood

michaelhood Jul 29, 2013

Contributor

pthread_create failed: Resource temporarily unavailable

that failure is environment specific.. you've exhausted RLIMIT_NPROC (or equivalent) so pthread_create is returning EAGAIN.

it looks like you're on OS X, so check out man 2 setrlimit and ulimit -u.

Contributor

michaelhood commented Jul 29, 2013

pthread_create failed: Resource temporarily unavailable

that failure is environment specific.. you've exhausted RLIMIT_NPROC (or equivalent) so pthread_create is returning EAGAIN.

it looks like you're on OS X, so check out man 2 setrlimit and ulimit -u.

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Jul 30, 2013

Member

managing limits on OSX is a complete shit show so I did some testing in a linux VM and was able to scale up to 4k connections each w/ a topic/channel at ~860mb resident and 40% CPU

so I think @michaelhood is right on the money, the crash you're seeing is just limits in OSX...

Member

mreiferson commented Jul 30, 2013

managing limits on OSX is a complete shit show so I did some testing in a linux VM and was able to scale up to 4k connections each w/ a topic/channel at ~860mb resident and 40% CPU

so I think @michaelhood is right on the money, the crash you're seeing is just limits in OSX...

@chetan

This comment has been minimized.

Show comment
Hide comment
@chetan

chetan Jul 30, 2013

Ok, that makes sense. Maybe some warnings could be printed if ulimit -n/u is too low? I believe nginx does this (among others).

chetan commented Jul 30, 2013

Ok, that makes sense. Maybe some warnings could be printed if ulimit -n/u is too low? I believe nginx does this (among others).

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Jul 30, 2013

Member

I did some profiling of nsqd while running with that many topics/channels/clients. It seems the bulk of the CPU time is spent in the runtime managing the heap of timers that are used in various places.

There is probably room for improvement with these code paths.

@chetan I'm not terribly keen on writing cross platform resource limit checks, but if you want to take a pass I will promise to review it :)

I'm going to close this specific issue since theres nothing to actually "fix". I will open up follow up issues when I better understand where nsqd is specifically spending its time at this scale.

Member

mreiferson commented Jul 30, 2013

I did some profiling of nsqd while running with that many topics/channels/clients. It seems the bulk of the CPU time is spent in the runtime managing the heap of timers that are used in various places.

There is probably room for improvement with these code paths.

@chetan I'm not terribly keen on writing cross platform resource limit checks, but if you want to take a pass I will promise to review it :)

I'm going to close this specific issue since theres nothing to actually "fix". I will open up follow up issues when I better understand where nsqd is specifically spending its time at this scale.

@mreiferson mreiferson closed this Jul 30, 2013

@mreiferson

This comment has been minimized.

Show comment
Hide comment
Member

mreiferson commented Jul 30, 2013

screen shot 2013-07-30 at 9 42 47 am

@chetan

This comment has been minimized.

Show comment
Hide comment
@chetan

chetan Jul 30, 2013

Fair enough. A documentation note should suffice, perhaps mentioning that the normal use case is to have < 100 topics/channels.

chetan commented Jul 30, 2013

Fair enough. A documentation note should suffice, perhaps mentioning that the normal use case is to have < 100 topics/channels.

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Jul 30, 2013

Member

Yea, that's a good suggestion. That wasn't the intention, though, so it looks like we have some work to do to be able to support those specific types of use cases.

Member

mreiferson commented Jul 30, 2013

Yea, that's a good suggestion. That wasn't the intention, though, so it looks like we have some work to do to be able to support those specific types of use cases.

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Jul 30, 2013

Member

actually, gonna re-open (updated the title)

Member

mreiferson commented Jul 30, 2013

actually, gonna re-open (updated the title)

@mreiferson mreiferson reopened this Jul 30, 2013

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Aug 13, 2013

Member

it's possible that http://golang.org/pkg/sync/#Cond might be useful in re-architecting the proliferation of timers

Member

mreiferson commented Aug 13, 2013

it's possible that http://golang.org/pkg/sync/#Cond might be useful in re-architecting the proliferation of timers

@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Sep 2, 2013

Member

Please see the book I wrote in the commit message for the attached commit.

cc @jehiah, @michaelhood, and @jayridge would love to hear your thoughts...

Member

mreiferson commented Sep 2, 2013

Please see the book I wrote in the commit message for the attached commit.

cc @jehiah, @michaelhood, and @jayridge would love to hear your thoughts...

Show outdated Hide outdated nsqd/client_v2.go
@@ -281,7 +281,7 @@ func (c *ClientV2) SetOutputBufferTimeout(desiredTimeout int) error {
timeout = -1
case desiredTimeout == 0:
// do nothing (use default)
case desiredTimeout >= 5 &&
case desiredTimeout >= 250 &&

This comment has been minimized.

@jehiah

jehiah Sep 3, 2013

Member

I don't think we need to change this here. It should still be possible for a client to request a 5ms buffer timeout.

This change here limits the min to 250ms. We only want to change the default from 5ms to 250ms.

(good commit comments btw)

@jehiah

jehiah Sep 3, 2013

Member

I don't think we need to change this here. It should still be possible for a client to request a 5ms buffer timeout.

This change here limits the min to 250ms. We only want to change the default from 5ms to 250ms.

(good commit comments btw)

This comment has been minimized.

@mreiferson

mreiferson Sep 3, 2013

Member

Yea, I was thinking about that. I'm not convinced that a 5ms lower bound makes sense... to your point, maybe it shouldn't be 250ms either though.

There are three states:

  1. max_in_flight = 1 behaves essentially unbuffered regardless of volume.
  2. max_in_flight > 1 and low-volume topic. If you want to prioritize delivery latency, the appropriate configuration is to disable output buffering (we should document this).
  3. max_in_flight > 1 and higher-volume topic. You can either prefer throughput (at the potential cost of a maximum delivery latency that can be set as low as this value) or disable output buffering to prioritize delivery latency.

I guess my argument is that you can't really have everything (which is what I feel like this was trying to do before). I like how it's still possible to prioritize latency with this value lower bounded at 250ms, it just means consciously choosing that tradeoff.

Ultimately, it's not the end of the world to allow it to be configured that low, it just needs to come with some disclaimers. But is there a better lower bound?

@mreiferson

mreiferson Sep 3, 2013

Member

Yea, I was thinking about that. I'm not convinced that a 5ms lower bound makes sense... to your point, maybe it shouldn't be 250ms either though.

There are three states:

  1. max_in_flight = 1 behaves essentially unbuffered regardless of volume.
  2. max_in_flight > 1 and low-volume topic. If you want to prioritize delivery latency, the appropriate configuration is to disable output buffering (we should document this).
  3. max_in_flight > 1 and higher-volume topic. You can either prefer throughput (at the potential cost of a maximum delivery latency that can be set as low as this value) or disable output buffering to prioritize delivery latency.

I guess my argument is that you can't really have everything (which is what I feel like this was trying to do before). I like how it's still possible to prioritize latency with this value lower bounded at 250ms, it just means consciously choosing that tradeoff.

Ultimately, it's not the end of the world to allow it to be configured that low, it just needs to come with some disclaimers. But is there a better lower bound?

This comment has been minimized.

@jehiah

jehiah Sep 3, 2013

Member

You have a fair point about arbitrarily picking a lower bound. We should allow >=1ms and just document the tradeoff between max latency and overhead. That should probably go in the docs about IDENTIFY options http://bitly.github.io/nsq/clients/tcp_protocol_spec.html but we should also have a paragraph in http://bitly.github.io/nsq/clients/building_client_libraries.html about the actual IDENTIFY options a client should support.

@jehiah

jehiah Sep 3, 2013

Member

You have a fair point about arbitrarily picking a lower bound. We should allow >=1ms and just document the tradeoff between max latency and overhead. That should probably go in the docs about IDENTIFY options http://bitly.github.io/nsq/clients/tcp_protocol_spec.html but we should also have a paragraph in http://bitly.github.io/nsq/clients/building_client_libraries.html about the actual IDENTIFY options a client should support.

nsqd: increase default output buffer timeout:
This change significantly impacts CPU usage on an `nsqd`
with many clients by greatly reducing the amount of work
the Go runtime has to do to maintain its internal heap
of timers.

Previously, the default was `5ms` which meant that 200 times
per second a timer event would fire (and consequently sift
the heap).

If you multiply that by a large # of clients, say 500, you
can quickly see why this would be significant.

    200 x 500 = 100,000 heap sifts per second

Obviously, it's not just the *number* of timers but their
*resolution* that's important.

The other interesting point is that despite the `V2` protocol
`messagePump` *ignoring* the timer when it looped after a previous
flush, the Go runtime still needs to maintain the timer and attempt
to write to the timer's output channel... costing precious cycles
and context switches.

By increasing the default resolution of the timer,
we significantly reduce the work the Go runtime needs to do.

In testing this reduced CPU usage on an *idle* nsqd from 33% to 3%
with 250 clients.

What does this impact?

The client output buffer timeout actually does two things...

 1. It determines the amount of time a client will accumulate
    data in its output buffer before flushing and writing to
    the socket.

 2. It bounds the maximum amount of time a client with
    `max_in_flight > 1` will wait for additional messages
    before flushing.

The second case is the interesting one.  To elaborate, the
pathological case is that for a client subscribed to a
*low-volume* topic, with a `max_in_flight > 1`, messages
can be delivered with up to `output_buffer_timeout` latency.

The `max_in_flight > 1` is important, because `messagePump`
automatically flushes when a client is in a state where it
is not ready to receive messages (i.e. `RDY = 0`).

This means that for clients that absolutely *need* to have
`max_in_flight > 1` on *low-volume* topics *and* prioritize
latency, they should disable output buffering, which we already
support via `IDENTIFY`.
@mreiferson

This comment has been minimized.

Show comment
Hide comment
@mreiferson

mreiferson Sep 3, 2013

Member

amended with updated lower bound, documented in c0cac78

Member

mreiferson commented Sep 3, 2013

amended with updated lower bound, documented in c0cac78

jehiah added a commit that referenced this pull request Sep 3, 2013

Merge pull request #236 from mreiferson/high_cpu_236
nsqd: high CPU when idle w/ lots (thousands) of topics/consumers

@jehiah jehiah merged commit 4c9328d into nsqio:master Sep 3, 2013

1 check passed

default The Travis CI build passed
Details

@mreiferson mreiferson referenced this pull request Feb 10, 2014

Closed

topic capacity #307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment