Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-net-connect-local-error #12964

Closed
wants to merge 1 commit into from
Closed

test: fix flaky test-net-connect-local-error #12964

wants to merge 1 commit into from

Conversation

sebastianplesciuc
Copy link

@sebastianplesciuc sebastianplesciuc commented May 11, 2017

Fixed test-net-connect-local-error by moving the test from parallel to sequential.
Reverted to commit https://github.com/nodejs/node/blob/eeae3bd07145a770209e4899a9d40f67109d3d01/test/parallel/test-net-connect-local-error.js. Added a few more assertions.

Fixes: #12950

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label May 11, 2017
@mscdex mscdex added the net Issues and PRs related to the net subsystem. label May 11, 2017
@refack refack self-assigned this May 11, 2017
Copy link
Contributor

@refack refack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small request, other than that, let's see what the CI sais

`${err.localAddress} !== ${common.localhostIPv4} in ${err}`
);
getUnassignedPort(common.mustCall((unassignedPort) => {
assert(unassignedPort);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ok, or even assert.strictEqual(typeof unassignedPort, 'number')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@refack Fixed! Thanks! I've also moved the getUnassignedPort call closer to where the value is actually used.

@refack
Copy link
Contributor

refack commented May 11, 2017

server.listen({port: 0}, common.mustCall(() => {
// When the server is closed this port will no longer be assigned
const unassignedPort = server.address().port;
server.close(common.mustCall(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't see a completely safe way of calling net.connect() to a free port. I would probably just move the original test (before the port changes) to sequential.
/cc @nodejs/testing @thefourtheye

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santigimeno thanks for the feedback
IMHO port + 1 will always be flaky even in sequential.
@nodejs/testing It seems like we need a way to find deterministically erring port for a few others tests as well... Re: #12996

Copy link
Contributor

@refack refack May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW For Windows (and acording to rfc793) the closed socket will enter TIME_WAIT state for 2*MSL and will not SYN,ACK and will not be reused by the OS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny story: I've been looking at this issue's brother #12951.
These tests are run in parallel first this one, then the other.
In the other one there's a server that's supposed to receive 6 connections, instead it received 7.
I wonder where that 7th request comes from 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO port + 1 will always be flaky even in sequential.

I'm not sure I follow. Can you elaborate?
If you mean that there can be another test using common.PORT + 1 (or common.PORT for that matter), I agree, but I think we'll be fine as long as there's no test on sequential listening/binding to 0 port.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO port + 1 will always be flaky even in sequential.

I think that's wrong, unless you're just arguing that some other process can always use that port. But I don't think we generally concern ourselves with that. I'm with @santigimeno: Moving it to sequential seems like the simpler and better option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thirded @santigimeno's suggestion, moving to sequential.

@refack
Copy link
Contributor

refack commented May 12, 2017

common.localhostIPv4,
`${err.localAddress} !== ${common.localhostIPv4} in ${err}`
);
server.close();
Copy link
Contributor

@refack refack May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought about it, you don't need getUnassignedPort. There is no need for a server to be alive for testing that a connection to an empty port will fail.
Do all the testing in the server.close() callback. Use the closed server's assigned port like you did in getUnassignedPort

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@refack You're right. I've used 8080 like I did here since no connection is actually made and this issue suggests that in the future there will be a linting rule against using common.PORT in parallel tests.

I've changed it now, thanks!

const port = server.address().port;
server.close(common.mustCall(() => {
const client = net.connect({
port: 8080,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should test other way around as well {port: port, localPort: 8080} (with a new client)

Copy link
Contributor

@refack refack May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually also need to assert the other properties of err like

  assert.strictEqual(err.syscall, 'connect');
  assert.strictEqual(err.code, 'ECONNREFUSED');
  assert.strictEqual(err.message, `connect ECONNREFUSED ${err.address}:${err.port} - Local (${err.localAddress}:${err.localPort)`);

@refack
Copy link
Contributor

refack commented May 13, 2017

IMHO we have a good solution for the flakiness of test.
My comment are improvements, and could go into a different PR if you don't have the time.

@refack
Copy link
Contributor

refack commented May 13, 2017

@sebastianplesciuc
Copy link
Author

@refack I could work on the requested changes in this PR either today or tomorrow. It will need a new CI. Let me know how to proceed. If you need to land this fast, I can make the changes in another PR.

@refack
Copy link
Contributor

refack commented May 13, 2017

It's the weekend, there's no rush, it you have the time and energy add the assertions and reversed connection to this PR.

@sebastianplesciuc
Copy link
Author

@refack Made the changes! Thanks :)

Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very uncomfortable with all of the PRs lately that add non-trivial amounts of code and complexity for tests where moving to sequential is the better solution. The marginal cost of having the test in sequential is negligible (maybe 150 ms on a few platforms?). Our slowest CI platforms don't benefit at all from having tests in parallel. (They run them sequentially anyway.) I'd much rather have simple, short, straightforward, easy-to-understand, easy-to-maintain tests. The time taken to do the whole reserve-a-port-then-close-the-server dance probably largely negates any benefit from having the test in parallel anyway.

@santigimeno
Copy link
Member

@Trott I have to agree with you (even though I was at first supporting those kind of complex changes)

@refack
Copy link
Contributor

refack commented May 13, 2017

I'm very uncomfortable with all of the PRs lately that add non-trivial amounts of code and complexity for tests where moving to sequential is the better solution. The marginal cost of having the test in sequential is negligible (maybe 150 ms on a few platforms?). Our slowest CI platforms don't benefit at all from having tests in parallel. (They run them sequentially anyway.) I'd much rather have simple, short, straightforward, easy-to-understand, easy-to-maintain tests. The time taken to do the whole reserve-a-port-then-close-the-server dance probably largely negates any benefit from having the test in parallel anyway.

  1. The original test is flaky even sequentially.
  2. I agree about non-trivial code in tests, so we removed the "reserve-a-port-then-close-the-server-dance". Most of what was added by e8eabd2 are extra assertions, and a new test case.
    The only non trivial code change was moving the logic into the server.on('close') callback.

@Trott PTAL

P.S. If you'd have written the review as a comment I would given it 👍 found it.

@refack
Copy link
Contributor

refack commented May 13, 2017

New CI: https://ci.nodejs.org/job/node-test-commit/9867/
( I have a hunch it'll fail on Windows :( )

@@ -3,25 +3,46 @@ const common = require('../common');
const assert = require('assert');
const net = require('net');

const fixedPort = 8080;
Copy link
Member

@Trott Trott May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we hard-coding 8080 and not using common.PORT or common.PORT + 1 or whatever? Just to avoid moving to sequential?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Trott I thought to use it because of your comment on this: #12639

Also as I understood from the common.PORT in parallel tests issue, it was planned to use a linting rule against using common.PORT in parallel tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastianplesciuc I think they convinced me to move the test to /sequential/ there it's Ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought to use it because of your comment on this: #12639

@sebastianplesciuc Not sure which comment you mean.

Also as I understood from the common.PORT in parallel tests issue, it was planned to use a linting rule against using common.PORT in parallel tests.

If/when that happens, eslint-disable comments can be used for any remaining valid common.PORT uses in parallel. Changing them now to accommodate a rule that may never come to pass is probably putting the cart before the horse.

Regardless, none of that applies if the test is moved to sequential. :-D

Lastly: I hope none of this is too frustrating for you. I appreciate all the work you're doing and I know it's not fun to get contradictory suggestions from people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hooray, we're all kinda sorta on the same page (or getting there) after all. :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastianplesciuc Oh, I think I see the test/comment you are referring to. In that case, other (intentional and for testing purposes) errors in the code prevent that port from ever being in use. If I understand what's going on in this test (and I may not!), that port (the one that is now 8080) does in fact get used. A connection is attempted there and ECONNREFUSED is expected, meaning nothing is listening on that port. So if something else is using that port, bad things happen. Again, I may be misunderstanding the test, but that's the way it seems to me. (Massively divided attention right now, apologies if I'm hurting more than I'm helping by participating.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Trott I understood why this should move to sequential. I'm not defending this, I understand why this is the case and I agree with your review. I just wanted to explain why I thought to use 8080 there.

Frankly, I'm not really sure what happens on every platform in a server's close callback. I just thought you guys might know and determine if this is an acceptable solution. I'm satisfied with the outcome and also I've learned some things along the way.

So, thanks for that :)

@Trott
Copy link
Member

Trott commented May 13, 2017

I think we should revert the changes in this file that were included in 94eed0f and move this file to sequential. That commit introduced port + 1 to replace common.PORT + 1 and that change is a bug.

  • port + 1 could be in use by another test (and is extremely likely to be because operating systems seem to often or always supply these ports in sequential order)

  • port + 1 could be an invalid port number if the operating system supplies port 65535 for port

I'm not sure what the nature is of the flakiness that's being seen, but that seems very likely to resolve it. (The first bullet point is the more important one in this regard. If another test running in parallel uses port 0 somewhere shortly after this test does, it is exceedingly likely to get port + 1 assigned resulting in a collision and flakiness in one or both tests.)

@refack
Copy link
Contributor

refack commented May 13, 2017

port + 1 could be in use by another test (and is extremely likely to be because operating systems seem to often or always supply these ports in sequential order)

Yeah I found which one #12951 there a server receives 7 requests when the test clearly only issues 6 🤣

But this test needs some fixin' since it's flaky even sequentially (with the server.listen(), and server.close() run synchronously 🤦‍♂️ )

@sebastianplesciuc we need to rethink this test, it fails on windows :(, and there's the hard coded 8080 port. So anyway I agree we need to move the test to /sequential/

@sebastianplesciuc
Copy link
Author

@refack I'll take a look at the code before the bind to 0 commit and try to make a PR with the move to sequential if that's ok. Should we close this PR?

@refack
Copy link
Contributor

refack commented May 14, 2017

@refack I'll take a look at the code before the bind to 0 commit and try to make a PR with the move to sequential if that's ok. Should we close this PR?

IMHO we should take current changes with us to /sequential/, old test format was just 👎

@refack
Copy link
Contributor

refack commented May 14, 2017

I'm not sure what the nature is of the flakiness that's being seen

@Trott this fragment

+const server = net.createServer();
+server.listen(0);
+const port = server.address().port;
const client = net.connect({		  const client = net.connect({
-  port: common.PORT + 1,		 +  port: port + 1,

port can be undefined in high load. reverting 94eed0f will solve that.

@sebastianplesciuc scratch my last comment, if you revert 94eed0f and move to /sequential/ that should give a nice test. (still can be this PR, discussion was good).

P.S. revert 94eed0f just on test/parallel/test-net-connect-local-error.js, yes?

@Trott
Copy link
Member

Trott commented May 14, 2017

port can be undefined in high load. reverting 94eed0f will solve that.

Moving the test to sequential also solves that. Add it to the list of reasons to avoid common.PORT in parallel tests.

@Trott
Copy link
Member

Trott commented May 14, 2017

P.S. revert 94eed0f just on test/parallel/test-net-connect-local-error.js, yes?

Agreed, not the whole commit, just the change to this file from that commit.

@Trott
Copy link
Member

Trott commented May 16, 2017

@Trott do you have any further comments? Our CI is green, and landing this will stop the false negative CIs on macOS & freeBSD...

@refack LGTM

Copy link
Member

@Trott Trott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is green

refack pushed a commit to refack/node that referenced this pull request May 16, 2017
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@refack
Copy link
Contributor

refack commented May 16, 2017

Landed in cf30d5e

@refack refack closed this May 16, 2017
@refack
Copy link
Contributor

refack commented May 16, 2017

refack pushed a commit to refack/node that referenced this pull request May 16, 2017
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@refack
Copy link
Contributor

refack commented May 16, 2017

Relanded in 0c2edd2 (forgot the missing LF)

@refack
Copy link
Contributor

refack commented May 16, 2017

Post reland CI: https://ci.nodejs.org/job/node-test-commit/9913/

@Trott
Copy link
Member

Trott commented May 16, 2017

Relanded in 0c2edd2 (forgot the missing LF)

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

@refack
Copy link
Contributor

refack commented May 16, 2017

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

After landing I run against master (and follow up, reverting if needed)

@gibfahn
Copy link
Member

gibfahn commented May 16, 2017

Not a fan of running CI after landing. You could just push your fix to their branch and run CI against the PR with your fixes in place.

@Trott To be clear, I think what @refack does is make sure CI ran before landing, then run CI again after landing to make sure no last minute other conflicting PR might have caused an issue. If this is the case, it's an example of extra rigour around the release process, and I'm quite impressed that anyone makes the effort.

@Trott
Copy link
Member

Trott commented May 16, 2017

@Trott To be clear, I think what @refack does is make sure CI ran before landing, then run CI again after landing to make sure no last minute other conflicting PR might have caused an issue. If this is the case, it's an example of extra rigour around the release process, and I'm quite impressed that anyone makes the effort.

Ah! I see now. Yes, that's awesome. 👍 Thanks for the clarification.

@Trott
Copy link
Member

Trott commented May 17, 2017

Thinking a bit more on this, I would ask that you (and everyone) please please please at least run make jslint/vcbuild jslint before pushing to master.

Our docs ask that people run make test/vcbuild test before doing pushing to master, but I know not everyone (especially those who land a lot of pull requests) does that.

JS linting is comparatively fast and would catch most of the "oops, I shouldn't have pushed to master" things that seem to come up from time to time, including this one.

@sebastianplesciuc
Copy link
Author

@Trott I apologize for not doing this. But I didn't expect the PR to land until after I've fixed the make test on my machine. As you can see above, I didn't tick the make -j4 test (UNIX), or vcbuild test (Windows) passes. Because it didn't pass, which made me think that you guys might give me some input on how to fix it, fixing it and commit the final version.

@refack
Copy link
Contributor

refack commented May 17, 2017

Thinking a bit more on this, I would ask that you (and everyone) please please please at least run make jslint/vcbuild jslint before pushing to master.

This is on me. It was a known lint failure I said I'd fix before landing #12964 (comment). I broke my own rule and landed this after 10PM 🤦‍♂️
[edit] I wanted to land this ASAP because of all the false negatives on the CI [/edit]

Re: nodejs/build#705 IMHO we should strive to move all the automatable (read; boring, repetitive, and human-error prone) to the CI.

@refack
Copy link
Contributor

refack commented May 17, 2017

P.S. a git hook that lints only git changed files:

#!/c/node/node
var cmd = require('child_process');
cmd.exec('git diff --cached --name-only --diff-filter=ACM | grep ".js$"', function (err, stdout) {
    if (stdout.length == 0) return;
    var args = stdout.split('\n');
    args.unshift('');
    args.pop();
    var cli = require("jshint/src/cli.js");
    cli.getBufferSize = function () { return 0; };
    cli.interpret(args);
});

@gibfahn
Copy link
Member

gibfahn commented May 17, 2017

P.S. a git hook that lints only git changed files:

This wouldn't catch things that are already committed right? make jslint should be pretty fast if you run it regularly (due to the caching) so it's probably worth running it on everything.

@Trott
Copy link
Member

Trott commented May 17, 2017

@sebastianplesciuc I was addressing people with commit bits on the repo. You didn't do anything wrong. (For that matter, @refack's mistake was minor,lots of folks have done it, and he was eager to fix CI.)

Everything's good. We can always improve though. Automation and git pre-commit hooks are both great things to apply here.

anchnk pushed a commit to anchnk/node that referenced this pull request May 19, 2017
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: nodejs#12964
Fixes: nodejs#12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@jasnell jasnell mentioned this pull request May 28, 2017
@gibfahn gibfahn mentioned this pull request Jun 15, 2017
3 tasks
MylesBorins pushed a commit that referenced this pull request Jun 22, 2017
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: #12964
Fixes: #12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jul 11, 2017
Fixed test-net-connect-local-error by moving the test from
parallel to sequential.

PR-URL: #12964
Fixes: #12950
Reviewed-By: Refael Ackermann <refack@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Jul 18, 2017
@refack refack removed their assignment Oct 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants