Skip to content
This repository

Improve cluster load balancing #3241

Closed
piscisaureus opened this Issue May 09, 2012 · 33 comments

6 participants

Bert Belder Ben Noordhuis Fedor Indutny darcy brettkiefer svenhenderson
Bert Belder
Collaborator

Needs some work in libuv.

@bnoordhuis, if you want

Ben Noordhuis

Sure, I'll work on it this week.

Fedor Indutny
Collaborator
indutny commented May 09, 2012

Just curious, what is an idea behind tihs?

Ben Noordhuis

A worker keeps accepting connections until the accept queue is empty. The process scheduler on Solaris and Linux and maybe others often gives slight preference to one process, which means that most new connections end up in a single worker process.

The proposed change is to accept a connection, start an idle watcher and not accept a new connection until the idle callback runs. That should guarantee that connections are more evenly distributed.

Fedor Indutny
Collaborator
indutny commented May 09, 2012

Also this may hit backlog's limit (which is 128 on OSX)

Ben Noordhuis

The backlog is configurable in master.

Fedor Indutny
Collaborator
indutny commented May 09, 2012

Not on OSX:

BUGS
     The backlog is currently limited (silently) to 128.
Ben Noordhuis

Noted, but it's probably not an issue. I kind of doubt that there are many people running high traffic production servers on OS X.

Fedor Indutny
Collaborator
indutny commented May 09, 2012

Well, but many people will run ab -n ... -c ... http://localhost:.../ on their dev machines... I expect ECONNREFUSED issues to be opened after this feature will land.

Bert Belder
Collaborator

@indutny That will probably only bite when your server is actually overloaded (since the server will start accepting again as soon as it is idle for a very small time unit).

The problem was more that if you have a server that does some io for every connection, it has to wait for some file descriptor before it can finish all the work related to that connection. However, since the server fd always stays in the epoll/kqueue/port file descriptor set, this connection-related io can be starved by new incoming connections. The problem is aggravated by the fact that schedulers try to avoid context switches. This results in very high response latency and very bad load balancing.

Ben Noordhuis

New commit: bnoordhuis/libuv@49e596a

Same core concept, slightly different approach, this time against libuv master.

darcy

great! Will this be released in v0.8?

Ben Noordhuis

No. It changes the ABI.

Ben Noordhuis bnoordhuis closed this October 12, 2012
Ben Noordhuis

Closing. I landed a better approach in joyent/libuv@c666b63 and it's available in node master now.

brettkiefer

@bnoordhuis We seem to be having a similar problem with Cluster and and some processes getting far more connections than others. We boiled the problem down, using Siege to hammer a Node.js server with simple https gets running Cluster with 12 worker child processes on a 12-core machine with hyperthreading (so 24 logical cores). We got more or less identical results with 0.8.8 and 0.9.3 (which as I understand it has your fix bnoordhuis/libuv@be2a217 for this).

The problem doesn't look bad at all on Debian Squeeze on bare metal with the 2.6.32 kernel, where we get something like this (left side is requests processed, right side is PID):
Node 0.9.3 on bare metal with 2.6.32 kernel.

reqs pid
146  2818
139  2820
211  2821
306  2823
129  2825
166  2827
138  2829
134  2831
227  2833
134  2835
129  2837
138  2838

But with Debian Squeeze on bare metal (same server) with the 3.2.0 kernel, we get

reqs pid
99   3207
186  3209
42   3210
131  3212
34   3214
53   3216
39   3218
54   3220
33   3222
931  3224
345  3226
312  3228

Does this sound to you like the same issue or something different? We understand that this may be more efficient for request throughput for normal https requests, but in production we're seeing an even more pronounced pileup of websocket connections, so that's more worrisome because of their long-lived nature and because they can go active in large numbers at any time.

Thanks!

Fedor Indutny
Collaborator

@brettkiefer can you please give a try to the tlsnappy. If it fits your needs it might give you much better performance. Also you can start it in hybrid mode, like 6 threads x 4 processes, or 4 threads x 6 processes.

Please ask me if you need any details.

brettkiefer

@indutny We probably could, but it seems like we're just dealing with a bug here, not an issue that would make us abandon Cluster; is there a reason you think tlsnappy would be a substantial improvement in balancing in the long term?

brettkiefer

@bnoordhuis But with Debian Squeeze on bare metal (same server as previous comment) and a build from master, we get similar problems.

UV_TCP_SINGLE_ACCEPT=2 with node 0.9.4-pre and 3.2.0 kernel:

reqs pid
  22 8544
  15 8546
  39 8547
  17 8549
  18 8551
1302 8553
 123 8555
  76 8557
  45 8559
 308 8561
  30 8563
 268 8565

UV_TCP_SINGLE_ACCEPT=0 with node 0.9.4-pre and 3.2.0 kernel:

reqs pid
  24 8803
  26 8805
  43 8806
 293 8808
  24 8810
 138 8812
 289 8814
  51 8816
  16 8818
  27 8820
  74 8822
1315 8824
brettkiefer

@bnoordhuis May I ask which OS and kernel version you're testing with?

Ben Noordhuis

@brettkiefer I run Ubuntu with a x86_64 3.7-rc7 kernel built from source.

Can you clone joyent/libuv, build make test/run-benchmarks CFLAGS=-O2 and run the following scriptlet a few times?

for TEST in tcp_multi_accept{2,4,8}; do
  for BACKOFF in 0 1; do
    env UV_TCP_SINGLE_ACCEPT=$BACKOFF test/run-benchmarks $TEST
  done
done

I'd be interested in the results for both kernels. You may have to raise ulimit -n a little.

It's possible that the tcp_multi_accept2 test fails for you - I had wanted to fix that but I haven't gotten around to it yet (that's also why I hadn't replied so far, sorry about that.)

svenhenderson

Hi @bnoordhuis I'm working with @brettkiefer trying to help track down this problem. With ulimit -n raised to 1000000 on both the 3.2.0 and 2.6.32 kernels we were not able to perform the tests on the most recent master checkout. I was able to track down the breakage and included stats from the last working commit if that is helpful. In both cases where it broke it was receiving an EADDRNOTAVAIL error here (https://github.com/joyent/libuv/blob/master/src/unix/tcp.c#L113).

2.6.32 kernel at commit joyent/libuv@b74b1c4 (last working commit for 3.2.0)

accept2: 49781 accepts/sec (250000 total)
  thread #0: 2824 accepts/sec (14183 total, 5.7%)
  thread #1: 46957 accepts/sec (235817 total, 94.3%)
accept2: 50375 accepts/sec (250000 total)
  thread #0: 2957 accepts/sec (14674 total, 5.9%)
  thread #1: 47418 accepts/sec (235326 total, 94.1%)
accept4: 44230 accepts/sec (250000 total)
  thread #0: 320 accepts/sec (1809 total, 0.7%)
  thread #1: 2402 accepts/sec (13578 total, 5.4%)
  thread #2: 394 accepts/sec (2227 total, 0.9%)
  thread #3: 41114 accepts/sec (232386 total, 93.0%)
accept4: 44195 accepts/sec (250000 total)
  thread #0: 592 accepts/sec (3346 total, 1.3%)
  thread #1: 41012 accepts/sec (231997 total, 92.8%)
  thread #2: 963 accepts/sec (5447 total, 2.2%)
  thread #3: 1628 accepts/sec (9210 total, 3.7%)
accept8: 51616 accepts/sec (250000 total)
  thread #0: 7494 accepts/sec (36297 total, 14.5%)
  thread #1: 5610 accepts/sec (27173 total, 10.9%)
  thread #2: 5724 accepts/sec (27722 total, 11.1%)
  thread #3: 6014 accepts/sec (29128 total, 11.7%)
  thread #4: 5712 accepts/sec (27664 total, 11.1%)
  thread #5: 5939 accepts/sec (28767 total, 11.5%)
  thread #6: 9344 accepts/sec (45255 total, 18.1%)
  thread #7: 5780 accepts/sec (27994 total, 11.2%)
accept8: 46275 accepts/sec (250000 total)
  thread #0: 16152 accepts/sec (87261 total, 34.9%)
  thread #1: 4334 accepts/sec (23412 total, 9.4%)
  thread #2: 4290 accepts/sec (23177 total, 9.3%)
  thread #3: 4747 accepts/sec (25647 total, 10.3%)
  thread #4: 4160 accepts/sec (22476 total, 9.0%)
  thread #5: 4046 accepts/sec (21857 total, 8.7%)
  thread #6: 4350 accepts/sec (23502 total, 9.4%)
  thread #7: 4196 accepts/sec (22668 total, 9.1%)

2.6.32 kernel at last working commit joyent/libuv@a8c6da8

accept2: 51802 accepts/sec (250000 total)
  thread #0: 8615 accepts/sec (41577 total, 16.6%)
  thread #1: 43187 accepts/sec (208423 total, 83.4%)
accept2: 49303 accepts/sec (250000 total)
  thread #0: 4947 accepts/sec (25087 total, 10.0%)
  thread #1: 44356 accepts/sec (224913 total, 90.0%)
accept4: 40615 accepts/sec (250000 total)
  thread #0: 1364 accepts/sec (8399 total, 3.4%)
  thread #1: 35089 accepts/sec (215989 total, 86.4%)
  thread #2: 2585 accepts/sec (15910 total, 6.4%)
  thread #3: 1576 accepts/sec (9702 total, 3.9%)
accept4: 39169 accepts/sec (250000 total)
  thread #0: 25151 accepts/sec (160525 total, 64.2%)
  thread #1: 3800 accepts/sec (24253 total, 9.7%)
  thread #2: 3400 accepts/sec (21702 total, 8.7%)
  thread #3: 6819 accepts/sec (43520 total, 17.4%)
accept8: 51362 accepts/sec (250000 total)
  thread #0: 5269 accepts/sec (25647 total, 10.3%)
  thread #1: 6028 accepts/sec (29339 total, 11.7%)
  thread #2: 10688 accepts/sec (52023 total, 20.8%)
  thread #3: 6388 accepts/sec (31092 total, 12.4%)
  thread #4: 5656 accepts/sec (27528 total, 11.0%)
  thread #5: 5900 accepts/sec (28718 total, 11.5%)
  thread #6: 5923 accepts/sec (28828 total, 11.5%)
  thread #7: 5511 accepts/sec (26825 total, 10.7%)
accept8: 50543 accepts/sec (250000 total)
  thread #0: 5079 accepts/sec (25123 total, 10.0%)
  thread #1: 5774 accepts/sec (28561 total, 11.4%)
  thread #2: 6438 accepts/sec (31843 total, 12.7%)
  thread #3: 9312 accepts/sec (46062 total, 18.4%)
  thread #4: 6704 accepts/sec (33160 total, 13.3%)
  thread #5: 5747 accepts/sec (28426 total, 11.4%)
  thread #6: 5992 accepts/sec (29638 total, 11.9%)
  thread #7: 5496 accepts/sec (27187 total, 10.9%)

2.6.32 kernel at next commit joyent/libuv@29f44c7 fails every run with this error:

Assertion failed in test/benchmark-multi-accept.c on line 384: 0 == uv_tcp_connect(&ctx->connect_req, handle, listen_addr, cl_connect_cb)

3.2.0 kernel at last working commit joyent/libuv@b74b1c4

accept2: 47875 accepts/sec (250000 total)
  thread #0: 19206 accepts/sec (100293 total, 40.1%)
  thread #1: 28669 accepts/sec (149707 total, 59.9%)
accept2: 49969 accepts/sec (250000 total)
  thread #0: 7657 accepts/sec (38310 total, 15.3%)
  thread #1: 42312 accepts/sec (211690 total, 84.7%)
accept4: 45407 accepts/sec (250000 total)
  thread #0: 1047 accepts/sec (5767 total, 2.3%)
  thread #1: 20332 accepts/sec (111941 total, 44.8%)
  thread #2: 839 accepts/sec (4621 total, 1.8%)
  thread #3: 23189 accepts/sec (127671 total, 51.1%)
accept4: 44066 accepts/sec (250000 total)
  thread #0: 3430 accepts/sec (19462 total, 7.8%)
  thread #1: 14976 accepts/sec (84964 total, 34.0%)
  thread #2: 3798 accepts/sec (21546 total, 8.6%)
  thread #3: 21862 accepts/sec (124028 total, 49.6%)
accept8: 41855 accepts/sec (250000 total)
  thread #0: 1687 accepts/sec (10076 total, 4.0%)
  thread #1: 1671 accepts/sec (9982 total, 4.0%)
  thread #2: 1522 accepts/sec (9093 total, 3.6%)
  thread #3: 5971 accepts/sec (35667 total, 14.3%)
  thread #4: 1719 accepts/sec (10270 total, 4.1%)
  thread #5: 25322 accepts/sec (151245 total, 60.5%)
  thread #6: 1718 accepts/sec (10260 total, 4.1%)
  thread #7: 2245 accepts/sec (13407 total, 5.4%)
accept8: 42491 accepts/sec (250000 total)
  thread #0: 2703 accepts/sec (15901 total, 6.4%)
  thread #1: 3260 accepts/sec (19181 total, 7.7%)
  thread #2: 2954 accepts/sec (17380 total, 7.0%)
  thread #3: 3217 accepts/sec (18930 total, 7.6%)
  thread #4: 3948 accepts/sec (23231 total, 9.3%)
  thread #5: 19288 accepts/sec (113483 total, 45.4%)
  thread #6: 3142 accepts/sec (18484 total, 7.4%)
  thread #7: 3979 accepts/sec (23410 total, 9.4%)

3.2.0 kernel at the next commit joyent/libuv@a8c6da8 fails every run with this error:

Assertion failed in test/benchmark-multi-accept.c on line 380: 0 == uv_tcp_connect(&ctx->connect_req, handle, listen_addr, cl_connect_cb)
Ben Noordhuis

@svenhenderson Thanks. It's possible that you have to raise sysctl net.ipv4.ip_local_port_range as well, the default (IIRC) is a fairly limited 32768-61000.

brettkiefer

@bnoordhuis When @svenhenderson and I set

net.ipv4.ip_local_port_range=1024 64000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

and skip the accept2 benchmark, which times out, we get better distribution with UV_TCP_SINGLE_ACCEPT=1 on the 3.2 kernel:

accept4: 35302 accepts/sec (250000 total)
  thread #0: 2277 accepts/sec (16128 total, 6.5%)
  thread #1: 6084 accepts/sec (43085 total, 17.2%)
  thread #2: 2965 accepts/sec (20996 total, 8.4%)
  thread #3: 23976 accepts/sec (169791 total, 67.9%)
accept4: 54252 accepts/sec (250000 total)
  thread #0: 12856 accepts/sec (59240 total, 23.7%)
  thread #1: 14243 accepts/sec (65634 total, 26.3%)
  thread #2: 12878 accepts/sec (59345 total, 23.7%)
  thread #3: 14275 accepts/sec (65781 total, 26.3%)
accept8: 40958 accepts/sec (250000 total)
  thread #0: 1874 accepts/sec (11441 total, 4.6%)
  thread #1: 2497 accepts/sec (15244 total, 6.1%)
  thread #2: 2054 accepts/sec (12538 total, 5.0%)
  thread #3: 3147 accepts/sec (19209 total, 7.7%)
  thread #4: 2857 accepts/sec (17441 total, 7.0%)
  thread #5: 5428 accepts/sec (33129 total, 13.3%)
  thread #6: 3418 accepts/sec (20862 total, 8.3%)
  thread #7: 19682 accepts/sec (120136 total, 48.1%)
accept8: 41335 accepts/sec (250000 total)
  thread #0: 1185 accepts/sec (7167 total, 2.9%)
  thread #1: 1344 accepts/sec (8128 total, 3.3%)
  thread #2: 6758 accepts/sec (40872 total, 16.3%)
  thread #3: 8051 accepts/sec (48694 total, 19.5%)
  thread #4: 1640 accepts/sec (9917 total, 4.0%)
  thread #5: 8921 accepts/sec (53959 total, 21.6%)
  thread #6: 3456 accepts/sec (20902 total, 8.4%)
  thread #7: 9980 accepts/sec (60361 total, 24.1%)

We're trying to figure out why we don't see a similar improvement on our Siege benchmark. Sven suggests that the nanosleep, while effective with a huge number of accepts/sec, doesn't make a big difference when you're at lower rates (like our 20/sec).

brettkiefer

. . . and on the 2.6.32 kernel, where we do not see a pronounced imbalance in our http Siege tests, UV_TCP_SINGLE_ACCEPT=1 seems to slow things way down:

accept4: 48405 accepts/sec (250000 total)
  thread #0: 2534 accepts/sec (13086 total, 5.2%)
  thread #1: 27567 accepts/sec (142376 total, 57.0%)
  thread #2: 3407 accepts/sec (17596 total, 7.0%)
  thread #3: 14898 accepts/sec (76942 total, 30.8%)
accept4: 4961 accepts/sec (250000 total)
  thread #0: 1167 accepts/sec (58828 total, 23.5%)
  thread #1: 1298 accepts/sec (65401 total, 26.2%)
  thread #2: 1285 accepts/sec (64733 total, 25.9%)
  thread #3: 1211 accepts/sec (61038 total, 24.4%)
accept8: 54621 accepts/sec (250000 total)
  thread #0: 6399 accepts/sec (29288 total, 11.7%)
  thread #1: 6571 accepts/sec (30076 total, 12.0%)
  thread #2: 9538 accepts/sec (43655 total, 17.5%)
  thread #3: 7417 accepts/sec (33950 total, 13.6%)
  thread #4: 6034 accepts/sec (27617 total, 11.0%)
  thread #5: 7068 accepts/sec (32350 total, 12.9%)
  thread #6: 5323 accepts/sec (24362 total, 9.7%)
  thread #7: 6271 accepts/sec (28702 total, 11.5%)
accept8: 18142 accepts/sec (250000 total)
  thread #0: 1928 accepts/sec (26563 total, 10.6%)
  thread #1: 2599 accepts/sec (35809 total, 14.3%)
  thread #2: 1707 accepts/sec (23520 total, 9.4%)
  thread #3: 2587 accepts/sec (35655 total, 14.3%)
  thread #4: 1836 accepts/sec (25307 total, 10.1%)
  thread #5: 3032 accepts/sec (41788 total, 16.7%)
  thread #6: 1675 accepts/sec (23085 total, 9.2%)
  thread #7: 2777 accepts/sec (38273 total, 15.3%)
Ben Noordhuis

Sven suggests that the nanosleep, while effective with a huge number of accepts/sec, doesn't make a big difference when you're at lower rates (like our 20/sec).

That is (or should be) correct.

UV_TCP_SINGLE_ACCEPT=1 seems to slow things way down:

That's true, unfortunately. Some workloads seem to trigger a weak spot in the kernel's scheduler and network stack. joyent/libuv#624 has more details if you're interested.

@piscisaureus and I have been discussing it for a while now. The problem is that it's hard, bordering on impossible, to reliably figure out if there is only one process listening on a particular socket. I'm talking about the general case here, we can optimize for some scenarios.

For the moment, we agreed that reduced raw accept() performance is an acceptable trade-off when it means that connections get more evenly distributed over listening processes.

But if you're experiencing major performance regressions, please let me know. You can join us on IRC if you want, we hang out in #libuv on freenode.org.

brettkiefer

So if it's true that this fix should have no effect at lower accept loads, then we actually wouldn't expect it to make any difference in the distribution of requests over listening processes in our case, that of a bunch of stateless web heads accepting only hundreds of connections per second. Does that sum it up?

brettkiefer

I added a gist with a simple cluster test file that illustrates the problem I'm seeing as simply as possible:

https://gist.github.com/4270418

To repro (or not, depending on your kernel):

  1. save https://gist.github.com/4270418 as test.js
  2. run with "./node test.js > pid.log"
  3. from another shell do "siege -t1m -c 50 http://localhost:8000"
  4. wait for the siege to finish
  5. ctrl-c the node process
  6. sort pid.log | uniq -c

With node at 0506d29 and libuv at 731adac, and UV_TCP_SINGLE_ACCEPT=1, balancing indeed seems to be no better than with node 0.8.8 or with UV_TCP_SINGLE_ACCEPT=0 -- they all come out pretty similar. Here is the run in the most up-to-date on a 12-core machine with 24 logical cores:

   reqs pid
    130 27825
    176 27827
    197 27828
    239 27829
    284 27831
    332 27833
    416 27835
    447 27837
    613 27839
    674 27841
    816 27843
   1305 27845

@bnoordhuis Does your setup balance that benchmark substantially better?

Ben Noordhuis

Does your setup balance that benchmark substantially better?

Yes. For the record, I'm running 0506d29 (current HEAD) on Linux 3.7-rc7 on an 8-core i7 @ 3.4 GHz (4 actual cores, 2 hyper-threads per core).

With a non-tweaked siege:

    387 12272
    449 12273
    457 12275
    496 12276

siege -c 50:

   1464 13263
   1227 13264
   1533 13266
   1411 13267

siege -b -c 50:

 249561 13169
 248308 13170
 250179 13172
 250679 13173

siege -b -c 500:

 125884 12358
 128354 12359
 128337 12361
 130843 12362

ab -c 500 -n 500000:

 106411 12320
 108085 12321
 109339 12323
 112365 12324
brettkiefer

Thanks! We're seeing better results on the 3.7 kernel (like a 3-to-1 rather than a 100-to-1 imbalance), so we're looking into kernel differences and we'll post here when we know more.

brettkiefer

The 3.6.10 kernel did far better on bare metal than the 3.2 kernel, but when we tried it on AWS cc2.8xlarge instances we were back to the same problem of 100x imbalances between processes. I now think we're seeing this, because we're fine on bare metal (on kernels other than 3.2) but completely imbalanced on HVM: http://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

We're going to see how we do on a Paravirtual AMI, but it's looking more and more like one just cannot count on the OS to distribute load appropriately between processes with multiple accept, and we'll just proxy locally with HAProxy or Node.

Ben Noordhuis

FWIW, I've disabled the accept() back-off patch in 7b2ef2d (joyent/libuv@dc559a5). It's causing performance drops > 50% on a number of benchmarks.

Ben Noordhuis

@brettkiefer FYI, we may switch to round-robin, see #4435 for details. The issue now is time and man power.

brettkiefer

@bnoordhuis Neat! That'd be great for us, anyway. Right now we're testing different HAProxy configurations for best performance and failover, but we'd love to go to Cluster round-robin. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.