Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add uv threadpool congestion regression test #23099

Merged

Conversation

@gireeshpunathil
Copy link
Member

commented Sep 26, 2018

Validate that massive dns lookups do not block filesytem I/O (or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns lookup were sufficient to engage libuv threadpool workers in a blocking manner, throttling other work items that need the pool. this test acts as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
@Trott

This comment has been minimized.

Copy link
Member

commented Sep 26, 2018

This test fails without an internet connection so it probably needs to go in test/internet rather than test/sequential.

Here's what I get if I disconnect from the network:

$ tools/test.py test/sequential/test-uv-threadpool-schedule.js 
=== release test-uv-threadpool-schedule ===                    
Path: sequential/test-uv-threadpool-schedule
assert.js:349
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 10, expected: 0.57
    at GetAddrInfoReqWrap.dns.lookup.common.mustCall (/Users/trott/io.js/test/sequential/test-uv-threadpool-schedule.js:45:14)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:349:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:58:17)
Command: out/Release/node /Users/trott/io.js/test/sequential/test-uv-threadpool-schedule.js
[00:00|% 100|+   0|-   1]: Done  
$
@cjihrig

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2018

This also seems like it will have the potential to be flakey when run on the CI.

}));
}

fs.readFile(__filename, common.mustCall((e, d) => {

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Sep 26, 2018

Contributor

Unused variables.

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

@thefourtheye - removed, thanks.

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain()}.com`, {}, common.mustCall((e, a, f) => {

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Sep 26, 2018

Contributor

Throw in an assert.ifError maybe?

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

sorry, didn't follow you - can you please elaborate? all these lookups are destined to fail, we are latching on the time they take for un-optimized full blown lookups (so as to maximize the threadpool worker thread's engagement)

for (let i = 0; i < slowIOmax; i++) {
// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Sep 26, 2018

Contributor

In that case, using a base string and then keep incrementing an offset would guarantee that there will be no repetition in the domain names used, right? Because, the random numbers might repeat (even though odds of that happening is less)

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

thanks for the suggestion, followed your suggestion.

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

@thefourtheye - I just spotted a drawback of doing this: if you run the test back-to-back, it fails the second time onwards - for obvious reasons (the domain strings being static, they get resolved faster). I don't know if it affects our CI tests though.

function randomDomain() {
const d = Buffer.alloc(10);
for (let i = 0; i < 10; i++)
d[i] = 92 + (Math.round(Math.random() * 13247)) % 26;

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Sep 26, 2018

Contributor

Did you mean to use 97 ('a') instead of 92 ('\')?

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

removed the method altogether. (yes, I meant a , don't know how I kept it at 92. ;) )

@gireeshpunathil gireeshpunathil force-pushed the gireeshpunathil:uv-threadpool-schedule branch from 95fdbbe to cfb651e Sep 27, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Sep 27, 2018

@Trott - moved this to internet folder. I didn't know the rules for placement; I see many tests that perform dns lookup on public websites in sequential folder, but looks like none of them are time sensitive.

@cjihrig - I acknowledge that the test is sensitive to response times. For that matter I took these precautions:

  • simplified the test by making only two most essential assertions (easy for spotting)
  • ideally the fastI/O should not take even 1/1000th of slowI/O (even those do not hold a dependency relation in any ways), but given the plurality of CI system types, I kept a conservative duration of 1/100th of that of slowI/O. If we relax it any further (either by increasing the dns lookup count or by reducing the expected ratio), I fear it violates the purpose of the test.
  • sprinkled comments abundantly so that if it fails one can easily reason out.

If you think any specific construct / logic is prone to flake, please let me know, I am happy to improve on it.

@gireeshpunathil gireeshpunathil force-pushed the gireeshpunathil:uv-threadpool-schedule branch from b114e0a to e2541ee Sep 27, 2018

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain}${i}.com`, {}, common.mustCall(() => {

This comment has been minimized.

Copy link
@richardlau

richardlau Sep 27, 2018

Member

Does the test still work if we use one of the reserved TLDs (e.g. .test)? https://tools.ietf.org/html/rfc2606

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member
-  dns.lookup(`${randomDomain}${i}.com`, {}, common.mustCall(() => {
+  dns.lookup(`${randomDomain}${i}.test`, {}, common.mustCall(() => {

@richardlau - I tested with this change (if that is what you meant) and it still works. But what would be an advantage of it?

This comment has been minimized.

Copy link
@richardlau

richardlau Sep 27, 2018

Member

The reserved TLDs are specifically reserved to avoid conflicts with actual registered domain names, e.g. nonexistent0.com might exist in the future.

They can also avoid unnecessary load on the DNS servers, see https://tools.ietf.org/html/rfc6761 section 6.

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 27, 2018

Author Member

@richardlau : ok - let me clarify: the purpose of the test is to engage (half of) the libuv threadpool workers as long as possible while making sure at least one thread is free and available to serve the file I/Os. So in that context, what we want is a list of domain names that are unique, and were unresolved earlier (resolved names are cached so retrieved faster than we wanted them to), existent or non-existent is not a consideration for the test. If the name is confusing, I can change it to unique0.com etc.

Am I missing anything?

This comment has been minimized.

Copy link
@addaleax

addaleax Sep 30, 2018

Member

If caching is an issue, we might need to generate domain names that are unique per-test-run, e.g. based on Date.now() or Math.random()?

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Sep 30, 2018

Author Member

yes (my first commit 1d0fcef) had it based on Math.random(). I guess I will re-instate that, thanks.

@gireeshpunathil

This comment has been minimized.

@Trott

This comment has been minimized.

Copy link
Member

commented Sep 30, 2018

I suppose pinging @nodejs/libuv to review may be appropriate here.

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2018

thanks @Trott . Also cc @nodejs/testing ?

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2018

re-instated the random domain function, else the test runs the risk of looking up previously resolved domain names and messing up with the time calculation.

new CI: https://ci.nodejs.org/job/node-test-pull-request/17620/

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Oct 4, 2018

arm failure is unrelated:

03:20:28 not ok 1816 parallel/test-tls-generic-stream
03:20:28   ---
03:20:28   duration_ms: 0.949
03:20:28   severity: fail
03:20:28   exitcode: 1
03:20:28   stack: |-
03:20:28     events.js:167
03:20:28           throw er; // Unhandled 'error' event
03:20:28           ^
03:20:28     
03:20:28     Error: 281473541976064:error:140943FC:SSL routines:ssl3_read_bytes:sslv3 alert bad record mac:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1407:SSL alert number 20
03:20:28     
03:20:28         at DuplexSocket.ondata (internal/wrap_js_stream.js:64:22)
03:20:28         at DuplexSocket.emit (events.js:182:13)
03:20:28         at addChunk (_stream_readable.js:283:12)
03:20:28         at readableAddChunk (_stream_readable.js:264:11)
03:20:28         at DuplexSocket.Readable.push (_stream_readable.js:219:10)
03:20:28         at DuplexSocket._write (/home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/test/common/duplexpair.js:28:22)
03:20:28         at doWrite (_stream_writable.js:410:12)
03:20:28         at clearBuffer (_stream_writable.js:540:7)
03:20:28         at DuplexSocket.Writable.uncork (_stream_writable.js:314:7)
03:20:28         at JSStreamWrap.doWrite (internal/wrap_js_stream.js:153:17)
03:20:28     Emitted 'error' event at:
03:20:28         at TLSSocket._emitTLSError (_tls_wrap.js:601:10)
03:20:28         at TLSWrap.onerror (_tls_wrap.js:267:11)
03:20:28         at DuplexSocket.ondata (internal/wrap_js_stream.js:64:22)
03:20:28         [... lines matching original stack trace ...]
03:20:28         at clearBuffer (_stream_writable.js:540:7)
03:20:28   ...
'use strict';

// test to validate massive dns lookups do not block filesytem I/O
// (or any fast I/O).Prior to https://github.com/libuv/libuv/pull/1845

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Oct 4, 2018

Contributor

Nit: Space after the dot.

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Oct 4, 2018

Author Member

thanks @thefourtheye - done.

// We need to refresh the domain string everytime,
// otherwise the TCP stack that cache the previous lookup
// returns result from memory, breaking all our Math.
dns.lookup(`${randomDomain()}.com`, {}, common.mustCall(() => {

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Oct 4, 2018

Contributor

Nit: The callback function can be moved out and used with common.mustCall(..., slowIOmax).

This comment has been minimized.

Copy link
@gireeshpunathil

gireeshpunathil Oct 4, 2018

Author Member

I moved the callback body out of the loop. However, issue with attaching expected call count to the common.mustCall is: the loop executes 100 times, and each time when we invoke common.mustCall with slowIOmax, the total expectation becomes 10K. So I left it as is (default:1). Hope this is fine with you?

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye Oct 4, 2018

Contributor

I was thinking about something like

const onResolve = common.mustCall(..., slowIOmax);
...
dns.lookup(..., onResolve);

But your current change is intuitive. Let's go with that.

@Trott

This comment has been minimized.

Copy link
Member

commented Oct 4, 2018

@Trott

This comment has been minimized.

Copy link
Member

commented Oct 4, 2018

@addaleax
Copy link
Member

left a comment

LGTM, but we might need to be careful about landing this because the libuv patch may be at fault for a regression in our tests, see nodejs/reliability#18 (comment)

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2018

thanks @addaleax for the info. sure, let me hold this until the libuv refix lands in node.

@gireeshpunathil gireeshpunathil changed the title test: add uv threadpool congestion regression test [do not land] test: add uv threadpool congestion regression test Oct 5, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2018

@addaleax - also if you can describe a higher level statement on the flakes that the current libuv caused, I can see if that scenario can be added to this test. Unfortunately I wasn't able to follow root cause of the recent flakes (at the thread level)

@addaleax

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

@gireeshpunathil So … as far as I can tell, the issue was that slow_io_work_running was increased with one mutex locked, and decreased with another mutex locked; that left it vulnerable to race conditions between workers, and so writes to the variable could get lost. That could make it seem like we have exhausted our # of available threads for slow I/O work, when in reality they were idle; that in turn prevented new DNS requests from being executed, and so the tests linked in libuv/libuv#2021 stalled and eventually timed out.

I think any test that would perform a lot of fast DNS requests (e.g. for localhost) in a row would be hit by that bug, at least on some platforms?

@Trott

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

@addaleax @gireeshpunathil Can this move forward? Or is it blocked on libuv for now? Or something else?

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Nov 21, 2018

I believe the test makes meaningful assertions only in the presence of libuv/libuv@daf04e8 , that is slated to land in node through libuv v1.24.x

@Trott Trott added the blocked label Nov 21, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 6, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2018

not sure what to make out of the CI result. the graphical view says something failed, but going into the link I can't see any?

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 9, 2018

Failures in both Linux and AIX are the same, test-cli-syntax that is supposedly addressed through #24907

Fresh CI: https://ci.nodejs.org/job/node-test-pull-request/19350/

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 10, 2018

after many runs of CI in expectation of a green sign giving up for now!

test-cli-syntax needs to be fixed / bailed out of the suite.

I will attempt to debug that first to see what is happening.

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 28, 2018

all the known failures are either marked as flaky or resolved; so another round of CI - looks like I can't resume from the old build, it is expired:
https://ci.nodejs.org/job/node-test-pull-request/19854/

@gireeshpunathil gireeshpunathil force-pushed the gireeshpunathil:uv-threadpool-schedule branch from 04786b3 to 05c3ae9 Dec 28, 2018

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 28, 2018

a test is timing out in arm. Resume Build CI: https://ci.nodejs.org/job/node-test-commit/24705/

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Dec 29, 2018

test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>

@gireeshpunathil gireeshpunathil force-pushed the gireeshpunathil:uv-threadpool-schedule branch from 05c3ae9 to 54fa59c Dec 30, 2018

@gireeshpunathil gireeshpunathil merged commit 54fa59c into nodejs:master Dec 30, 2018

1 of 2 checks passed

Travis CI - Pull Request Build Errored
Details
Travis CI - Branch Build Passed
Details
targos added a commit that referenced this pull request Jan 1, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@Trott

This comment has been minimized.

Copy link
Member

commented Jan 1, 2019

I'll open a separate issue, but this test failed on node-daily-master (the only place where internet tests are run) last night. Considering it was only added two days, that's probably cause for concern?

https://ci.nodejs.org/job/node-test-commit-custom-suites/810/default/console

test-rackspace-ubuntu1604-x64-1

00:02:11 not ok 25 internet/test-uv-threadpool-schedule
00:02:11   ---
00:02:11   duration_ms: 0.613
00:02:11   severity: fail
00:02:11   exitcode: 1
00:02:11   stack: |-
00:02:11     assert.js:351
00:02:11         throw err;
00:02:11         ^
00:02:11     
00:02:11     AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 17, expected: 3.63
00:02:11         at GetAddrInfoReqWrap.onResolve (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/internet/test-uv-threadpool-schedule.js:41:12)
00:02:11         at GetAddrInfoReqWrap.callback (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/common/index.js:376:15)
00:02:11         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
00:02:11   ...
@thefourtheye

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

Interesting. Looks like fast IO is really not 100 times faster than slow IO in our infra.

@gireeshpunathil Would it be okay to change the test to ensure that the fast IO takes lesser time (not 1/100th but simply fastIOtime < slowIOtime) than slow IO? Basically it will ensure that DNS lookups do not impact fs operations.

@gireeshpunathil

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2019

@thefourtheye - yes, and that is @Trott's #25358

refack added a commit to refack/node that referenced this pull request Jan 14, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: nodejs#8436

PR-URL: nodejs#23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@BridgeAR BridgeAR referenced this pull request Jan 16, 2019
@MylesBorins MylesBorins referenced this pull request Jan 24, 2019
BethGriggs added a commit that referenced this pull request Apr 17, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
BethGriggs added a commit that referenced this pull request Apr 28, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
@BethGriggs BethGriggs referenced this pull request May 1, 2019
BethGriggs added a commit that referenced this pull request May 10, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
MylesBorins added a commit that referenced this pull request May 16, 2019
test: regression test for uv threadpool congestion
Validate that massive dns lookups do not block filesytem I/O
(or any fast I/O for that matter).
Prior to libuv/libuv#1845 few back-to-back dns
lookup were sufficient to engage libuv threadpool workers in a blocking
manner, throttling other work items that need the pool. this test acts
as a regression test for the same.

Start slow and fast I/Os together, and make sure fast I/O can complete
in at least in 1/100th of time for slow I/O.

Refs: libuv/libuv#1845
Refs: #8436

PR-URL: #23099
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.