Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-net-connect-local-error #12964

Conversation

@sebastianplesciuc
Copy link
Contributor

commented May 11, 2017

Fixed test-net-connect-local-error by moving the test from parallel to sequential.
Reverted to commit https://github.com/nodejs/node/blob/eeae3bd07145a770209e4899a9d40f67109d3d01/test/parallel/test-net-connect-local-error.js. Added a few more assertions.

Fixes: #12950

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

@mscdex mscdex added the net label May 11, 2017

@refack refack self-assigned this May 11, 2017

@refack
Copy link
Member

left a comment

One small request, other than that, let's see what the CI sais

test/parallel/test-net-connect-local-error.js Outdated
`${err.localAddress} !== ${common.localhostIPv4} in ${err}`
);
getUnassignedPort(common.mustCall((unassignedPort) => {
assert(unassignedPort);

This comment has been minimized.

Copy link
@refack

refack May 11, 2017

Member

Use assert.ok, or even assert.strictEqual(typeof unassignedPort, 'number')

This comment has been minimized.

Copy link
@sebastianplesciuc

sebastianplesciuc May 12, 2017

Author Contributor

@refack Fixed! Thanks! I've also moved the getUnassignedPort call closer to where the value is actually used.

@refack

This comment has been minimized.

Copy link
Member

commented May 11, 2017

@sebastianplesciuc sebastianplesciuc force-pushed the sebastianplesciuc:fix-flaky-test-net-connect branch May 12, 2017

test/parallel/test-net-connect-local-error.js Outdated
server.listen({port: 0}, common.mustCall(() => {
// When the server is closed this port will no longer be assigned
const unassignedPort = server.address().port;
server.close(common.mustCall(() => {

This comment has been minimized.

Copy link
@santigimeno

santigimeno May 12, 2017

Member

TBH I don't see a completely safe way of calling net.connect() to a free port. I would probably just move the original test (before the port changes) to sequential.
/cc @nodejs/testing @thefourtheye

This comment has been minimized.

Copy link
@refack

refack May 12, 2017

Member

@santigimeno thanks for the feedback
IMHO port + 1 will always be flaky even in sequential.
@nodejs/testing It seems like we need a way to find deterministically erring port for a few others tests as well... Re: #12996

This comment has been minimized.

Copy link
@refack

refack May 12, 2017

Member

FWIW For Windows (and acording to rfc793) the closed socket will enter TIME_WAIT state for 2*MSL and will not SYN,ACK and will not be reused by the OS.

This comment has been minimized.

Copy link
@refack

refack May 12, 2017

Member

Funny story: I've been looking at this issue's brother #12951.
These tests are run in parallel first this one, then the other.
In the other one there's a server that's supposed to receive 6 connections, instead it received 7.
I wonder where that 7th request comes from 🤣

This comment has been minimized.

Copy link
@santigimeno

santigimeno May 13, 2017

Member

IMHO port + 1 will always be flaky even in sequential.

I'm not sure I follow. Can you elaborate?
If you mean that there can be another test using common.PORT + 1 (or common.PORT for that matter), I agree, but I think we'll be fine as long as there's no test on sequential listening/binding to 0 port.

This comment has been minimized.

Copy link
@Trott

Trott May 13, 2017

Member

IMHO port + 1 will always be flaky even in sequential.

I think that's wrong, unless you're just arguing that some other process can always use that port. But I don't think we generally concern ourselves with that. I'm with @santigimeno: Moving it to sequential seems like the simpler and better option.

This comment has been minimized.

Copy link
@thefourtheye

thefourtheye May 14, 2017

Contributor

Thirded @santigimeno's suggestion, moving to sequential.

@refack
refack approved these changes May 12, 2017
@refack refack referenced this pull request May 12, 2017
3 of 3 tasks complete
test/parallel/test-net-connect-local-error.js Outdated
common.localhostIPv4,
`${err.localAddress} !== ${common.localhostIPv4} in ${err}`
);
server.close();

This comment has been minimized.

Copy link
@refack

refack May 12, 2017

Member

I just thought about it, you don't need getUnassignedPort. There is no need for a server to be alive for testing that a connection to an empty port will fail.
Do all the testing in the server.close() callback. Use the closed server's assigned port like you did in getUnassignedPort

This comment has been minimized.

Copy link
@sebastianplesciuc

sebastianplesciuc May 13, 2017

Author Contributor

@refack You're right. I've used 8080 like I did here since no connection is actually made and this issue suggests that in the future there will be a linting rule against using common.PORT in parallel tests.

I've changed it now, thanks!

@sebastianplesciuc sebastianplesciuc force-pushed the sebastianplesciuc:fix-flaky-test-net-connect branch May 13, 2017

test/parallel/test-net-connect-local-error.js Outdated
const port = server.address().port;
server.close(common.mustCall(() => {
const client = net.connect({
port: 8080,

This comment has been minimized.

Copy link
@refack

refack May 13, 2017

Member

I think we should test other way around as well {port: port, localPort: 8080} (with a new client)

This comment has been minimized.

Copy link
@refack

refack May 13, 2017

Member

Actually also need to assert the other properties of err like

  assert.strictEqual(err.syscall, 'connect');
  assert.strictEqual(err.code, 'ECONNREFUSED');
  assert.strictEqual(err.message, `connect ECONNREFUSED ${err.address}:${err.port} - Local (${err.localAddress}:${err.localPort)`);
@refack
refack approved these changes May 13, 2017
@refack

This comment has been minimized.

Copy link
Member

commented May 13, 2017

IMHO we have a good solution for the flakiness of test.
My comment are improvements, and could go into a different PR if you don't have the time.

@refack

This comment has been minimized.

@sebastianplesciuc

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2017

@refack I could work on the requested changes in this PR either today or tomorrow. It will need a new CI. Let me know how to proceed. If you need to land this fast, I can make the changes in another PR.

@refack

This comment has been minimized.

Copy link
Member

commented May 13, 2017

It's the weekend, there's no rush, it you have the time and energy add the assertions and reversed connection to this PR.

@refack refack referenced this pull request May 13, 2017
3 of 3 tasks complete

@sebastianplesciuc sebastianplesciuc force-pushed the sebastianplesciuc:fix-flaky-test-net-connect branch May 13, 2017

@sebastianplesciuc

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2017

@refack Made the changes! Thanks :)

@Trott
Copy link
Member

left a comment

I'm very uncomfortable with all of the PRs lately that add non-trivial amounts of code and complexity for tests where moving to sequential is the better solution. The marginal cost of having the test in sequential is negligible (maybe 150 ms on a few platforms?). Our slowest CI platforms don't benefit at all from having tests in parallel. (They run them sequentially anyway.) I'd much rather have simple, short, straightforward, easy-to-understand, easy-to-maintain tests. The time taken to do the whole reserve-a-port-then-close-the-server dance probably largely negates any benefit from having the test in parallel anyway.

@santigimeno

This comment has been minimized.

Copy link
Member

commented May 13, 2017

@Trott I have to agree with you (even though I was at first supporting those kind of complex changes)

@refack

This comment has been minimized.

Copy link
Member

commented May 13, 2017

I'm very uncomfortable with all of the PRs lately that add non-trivial amounts of code and complexity for tests where moving to sequential is the better solution. The marginal cost of having the test in sequential is negligible (maybe 150 ms on a few platforms?). Our slowest CI platforms don't benefit at all from having tests in parallel. (They run them sequentially anyway.) I'd much rather have simple, short, straightforward, easy-to-understand, easy-to-maintain tests. The time taken to do the whole reserve-a-port-then-close-the-server dance probably largely negates any benefit from having the test in parallel anyway.

  1. The original test is flaky even sequentially.
  2. I agree about non-trivial code in tests, so we removed the "reserve-a-port-then-close-the-server-dance". Most of what was added by e8eabd2 are extra assertions, and a new test case.
    The only non trivial code change was moving the logic into the server.on('close') callback.

@Trott PTAL

P.S. If you'd have written the review as a comment I would given it 👍 found it.

@refack

This comment has been minimized.

Copy link
Member

commented May 13, 2017

New CI: https://ci.nodejs.org/job/node-test-commit/9867/
( I have a hunch it'll fail on Windows :( )

test/parallel/test-net-connect-local-error.js Outdated
@@ -3,25 +3,46 @@ const common = require('../common');
const assert = require('assert');
const net = require('net');

const fixedPort = 8080;

This comment has been minimized.

Copy link
@Trott

Trott May 13, 2017

Member

Why are we hard-coding 8080 and not using common.PORT or common.PORT + 1 or whatever? Just to avoid moving to sequential?

This comment has been minimized.

Copy link
@refack

refack May 13, 2017

Member

This needs to change.

This comment has been minimized.

Copy link
@sebastianplesciuc

sebastianplesciuc May 13, 2017

Author Contributor

@Trott I thought to use it because of your comment on this: #12639

Also as I understood from the common.PORT in parallel tests issue, it was planned to use a linting rule against using common.PORT in parallel tests.

This comment has been minimized.

Copy link
@refack

refack May 13, 2017

Member

@sebastianplesciuc I think they convinced me to move the test to /sequential/ there it's Ok.

This comment has been minimized.

Copy link
@Trott

Trott May 13, 2017

Member

I thought to use it because of your comment on this: #12639

@sebastianplesciuc Not sure which comment you mean.

Also as I understood from the common.PORT in parallel tests issue, it was planned to use a linting rule against using common.PORT in parallel tests.

If/when that happens, eslint-disable comments can be used for any remaining valid common.PORT uses in parallel. Changing them now to accommodate a rule that may never come to pass is probably putting the cart before the horse.

Regardless, none of that applies if the test is moved to sequential. :-D

Lastly: I hope none of this is too frustrating for you. I appreciate all the work you're doing and I know it's not fun to get contradictory suggestions from people.

This comment has been minimized.

Copy link
@Trott

Trott May 13, 2017

Member

Oh, hooray, we're all kinda sorta on the same page (or getting there) after all. :-D

This comment has been minimized.

Copy link
@Trott

Trott May 13, 2017

Member

@sebastianplesciuc Oh, I think I see the test/comment you are referring to. In that case, other (intentional and for testing purposes) errors in the code prevent that port from ever being in use. If I understand what's going on in this test (and I may not!), that port (the one that is now 8080) does in fact get used. A connection is attempted there and ECONNREFUSED is expected, meaning nothing is listening on that port. So if something else is using that port, bad things happen. Again, I may be misunderstanding the test, but that's the way it seems to me. (Massively divided attention right now, apologies if I'm hurting more than I'm helping by participating.)

This comment has been minimized.

Copy link
@sebastianplesciuc

sebastianplesciuc May 14, 2017

Author Contributor

@Trott I understood why this should move to sequential. I'm not defending this, I understand why this is the case and I agree with your review. I just wanted to explain why I thought to use 8080 there.

Frankly, I'm not really sure what happens on every platform in a server's close callback. I just thought you guys might know and determine if this is an acceptable solution. I'm satisfied with the outcome and also I've learned some things along the way.

So, thanks for that :)

@Trott

This comment has been minimized.

Copy link
Member

commented May 13, 2017

I think we should revert the changes in this file that were included in 94eed0f and move this file to sequential. That commit introduced port + 1 to replace common.PORT + 1 and that change is a bug.

  • port + 1 could be in use by another test (and is extremely likely to be because operating systems seem to often or always supply these ports in sequential order)

  • port + 1 could be an invalid port number if the operating system supplies port 65535 for port

I'm not sure what the nature is of the flakiness that's being seen, but that seems very likely to resolve it. (The first bullet point is the more important one in this regard. If another test running in parallel uses port 0 somewhere shortly after this test does, it is exceedingly likely to get port + 1 assigned resulting in a collision and flakiness in one or both tests.)

@refack

This comment has been minimized.

Copy link
Member

commented May 13, 2017

port + 1 could be in use by another test (and is extremely likely to be because operating systems seem to often or always supply these ports in sequential order)

Yeah I found which one #12951 there a server receives 7 requests when the test clearly only issues 6 🤣

But this test needs some fixin' since it's flaky even sequentially (with the server.listen(), and server.close() run synchronously 🤦‍♂️ )

@sebastianplesciuc we need to rethink this test, it fails on windows :(, and there's the hard coded 8080 port. So anyway I agree we need to move the test to /sequential/

@sebastianplesciuc

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2017

@refack I'll take a look at the code before the bind to 0 commit and try to make a PR with the move to sequential if that's ok. Should we close this PR?

@refack

This comment has been minimized.

Copy link
Member

commented May 14, 2017

@refack I'll take a look at the code before the bind to 0 commit and try to make a PR with the move to sequential if that's ok. Should we close this PR?

IMHO we should take current changes with us to /sequential/, old test format was just 👎

@Trott

This comment has been minimized.

Copy link
Member

commented May 16, 2017

@Trott do you have any further comments? Our CI is green, and landing this will stop the false negative CIs on macOS & freeBSD...

@refack LGTM

@Trott
Trott approved these changes May 16, 2017
Copy link
Member

left a comment

LGTM if CI is green

refack added a commit to refack/node that referenced this pull request May 16, 2017
test: fixed flaky test-net-connect-local-error
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@refack

This comment has been minimized.

Copy link
Member

commented May 16, 2017

Landed in cf30d5e

@refack refack closed this May 16, 2017

@refack

This comment has been minimized.

Copy link
Member

commented May 16, 2017

refack added a commit to refack/node that referenced this pull request May 16, 2017
test: fixed flaky test-net-connect-local-error
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@refack

This comment has been minimized.

Copy link
Member

commented May 16, 2017

Relanded in 0c2edd2 (forgot the missing LF)

@refack

This comment has been minimized.

Copy link
Member

commented May 16, 2017

@Trott

This comment has been minimized.

Copy link
Member

commented May 16, 2017

Relanded in 0c2edd2 (forgot the missing LF)

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

@refack

This comment has been minimized.

Copy link
Member

commented May 16, 2017

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

After landing I run against master (and follow up, reverting if needed)

@gibfahn

This comment has been minimized.

Copy link
Member

commented May 16, 2017

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

@Trott To be clear, I think what @refack does is make sure CI ran before landing, then run CI again after landing to make sure no last minute other conflicting PR might have caused an issue. If this is the case, it's an example of extra rigour around the release process, and I'm quite impressed that anyone makes the effort.

@Trott

This comment has been minimized.

Copy link
Member

commented May 16, 2017

@Trott To be clear, I think what @refack does is make sure CI ran before landing, then run CI again after landing to make sure no last minute other conflicting PR might have caused an issue. If this is the case, it's an example of extra rigour around the release process, and I'm quite impressed that anyone makes the effort.

Ah! I see now. Yes, that's awesome. 👍 Thanks for the clarification.

@Trott

This comment has been minimized.

Copy link
Member

commented May 17, 2017

Thinking a bit more on this, I would ask that you (and everyone) please please please at least run make jslint/vcbuild jslint before pushing to master.

Our docs ask that people run make test/vcbuild test before doing pushing to master, but I know not everyone (especially those who land a lot of pull requests) does that.

JS linting is comparatively fast and would catch most of the "oops, I shouldn't have pushed to master" things that seem to come up from time to time, including this one.

@sebastianplesciuc

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2017

@Trott I apologize for not doing this. But I didn't expect the PR to land until after I've fixed the make test on my machine. As you can see above, I didn't tick the make -j4 test (UNIX), or vcbuild test (Windows) passes. Because it didn't pass, which made me think that you guys might give me some input on how to fix it, fixing it and commit the final version.

@refack

This comment has been minimized.

Copy link
Member

commented May 17, 2017

Thinking a bit more on this, I would ask that you (and everyone) please please please at least run make jslint/vcbuild jslint before pushing to master.

This is on me. It was a known lint failure I said I'd fix before landing #12964 (comment). I broke my own rule and landed this after 10PM 🤦‍♂️
[edit] I wanted to land this ASAP because of all the false negatives on the CI [/edit]

Re: nodejs/build#705 IMHO we should strive to move all the automatable (read; boring, repetitive, and human-error prone) to the CI.

@refack

This comment has been minimized.

Copy link
Member

commented May 17, 2017

P.S. a git hook that lints only git changed files:

#!/c/node/node
var cmd = require('child_process');
cmd.exec('git diff --cached --name-only --diff-filter=ACM | grep ".js$"', function (err, stdout) {
    if (stdout.length == 0) return;
    var args = stdout.split('\n');
    args.unshift('');
    args.pop();
    var cli = require("jshint/src/cli.js");
    cli.getBufferSize = function () { return 0; };
    cli.interpret(args);
});
@gibfahn

This comment has been minimized.

Copy link
Member

commented May 17, 2017

P.S. a git hook that lints only git changed files:

This wouldn't catch things that are already committed right? make jslint should be pretty fast if you run it regularly (due to the caching) so it's probably worth running it on everything.

@Trott

This comment has been minimized.

Copy link
Member

commented May 17, 2017

@sebastianplesciuc I was addressing people with commit bits on the repo. You didn't do anything wrong. (For that matter, @refack's mistake was minor,lots of folks have done it, and he was eager to fix CI.)

Everything's good. We can always improve though. Automation and git pre-commit hooks are both great things to apply here.

@refack refack referenced this pull request May 17, 2017
3 of 3 tasks complete
anchnk pushed a commit to anchnk/node that referenced this pull request May 19, 2017
Sebastian Plesciuc Olivier Martin
test: fixed flaky test-net-connect-local-error
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@jasnell jasnell referenced this pull request May 28, 2017
@gibfahn gibfahn referenced this pull request Jun 15, 2017
2 of 3 tasks complete
MylesBorins added a commit that referenced this pull request Jun 22, 2017
test: fixed flaky test-net-connect-local-error
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: #12964
Fixes: #12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins added a commit that referenced this pull request Jul 11, 2017
test: fixed flaky test-net-connect-local-error
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: #12964
Fixes: #12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@MylesBorins MylesBorins referenced this pull request Jul 18, 2017

@refack refack removed their assignment Oct 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.