-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "net: enable autoSelectFamily by default" #48403
Conversation
This reverts commit 8b51c1a. Signed-off-by: Matteo Collina <hello@matteocollina.com>
Review requested:
|
cc @nodejs/releasers @nodejs/tsc This is technically semver-major but it's useful only if it is backported to Node v20. |
We might want to land this only to v20.x. |
For context, there are related discussions in #47644, #48028, and #48145. @silverwind suggested reverting in #48145, and other collaborators have mentioned that they'd have preferred not to enable Happy Eyeballs in the first place. However, according to @ShogunPanda, all known bugs (aside from deviating from the RFC) are fixed in 20.3.0. Do the fastify issues persist with 20.3.0 @mcollina? |
Yes, it persist with 20.3.0. |
About having this enabled by default: I'm a bit conflicted on this. Regardless of the implementation, I wonder if will cause more harm than good to users. To clarify, it is good that we had lot of bugs reports and we were able to fix them. But I start thinking a feature like this has too many network implications that probably we should have it disable by default and always have user explicitly opt-in when they need it. |
My preference would be to revert only for v20 and working out the remaining issues on v21+ (#48145 would be a must-have for me). |
I don't understand the parallel connection request. According to Happy Eyeballs RFC 8305, section 5, connections attempts SHOULD NOT be made in parallel. |
I think you are misinterpreting the RFC. It just says they should not start in parallel (meaning true concurrency, e.g. multiple CPUs), but they can and should be pending at the same time. Essentially the connections are raced against each other. At least that is the whole point of Happy Eyeballs to my understanding and Wikipedia seems to agree. |
I'm not quite sure about it. The RFC complete sentence is:
So it's not referring to just resources on the originating machine. If we implement like that, each new connection will result in at least two connection attempts. This will affect source machine, routers, ISP and so forth. |
I think the RFC authors mean that multiple connection attempts should not be started simultaneously. A good implementation should base the delay between attempts on the SYN retransmission, so at most two SYN packets would be sent at a time in a proper parallel implementation. @silverwind is right. The important sentence from the RFC is this:
|
The discussion around parallel connections should happen in #48145. As far as this PR goes, there are test failures, so presumably either more commits need to be reverted or additional commits to fix the tests. |
@richardlau Sure, we will move the discussion there. @mcollina I checked your fastify failure and I made several tests using main branch. The problems seems not to be in family autoselection (as I manually disabled in the code) but in the presence of |
I confirmed in private with @mcollina. The fastify failures are unrelated to family auto selection and it seems to be a weird combination between the tap ecosystem, Node and NODE_OPTIONS. Do we want to close this PR or do we still want to proceed? |
I don't have a strong opinion. The current implementation is flawed, apparently both due to technical reasons and because of disagreement w.r.t. how the RFC should be interpreted, and most importantly, it does introduce new network problems. On the other hand, it also solves some historic (and somehow still relevant) network problems. Myself and others would have preferred |
According to KararTY/dank-twitch-irc#13 (comment), the
Does this imply that you are in favor of disabling the feature @ShogunPanda? |
@tniessen To be honest no. I checked KararTY/dank-twitch-irc#13 (comment) and I could not reproduce the issue. All the other bug reports you mentioned are lacking the version so it's impossible to see whether they were in 20.3.0 or not. Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this? |
Because each "annoying duplicate bug report", as you put it, represents at least one but potentially even a large number of negatively affected users, whereas we have no reliable metrics for the benefits of the feature. (To be fair, I am no expert in this matter so I tend to rely on others' expertise, and I have yet to meet someone more knowledgeable than Ben, for example, who was against enabling this by default.) |
I see. |
In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0. But in retrospect I think we should've reverted this a month ago. When we notice any large-scale regression, it's usually never really worth waiting for a non-trivial fix. Reverting and do a patch release ASAP and then wait for a fix to reland is almost always better. The more we wait the more users get frustrated and will have to add workarounds to their code/environment which are meant to go away. |
I couldn't agree more @joyeecheung. We should have reverted much earlier and postponed to Node.js 21, which also won't become LTS, or perhaps indefinitely.
It has been confirmed, see KararTY/dank-twitch-irc#13 (comment). But I guess the intention is to fix yet another bug and hope that that finally resolves the issue. |
@tniessen I already fixed that bug. AFAIK there is only npm/cli#6409 (which I'm working on at the moment) and then I have nothing else reported. Am I missing any bug? |
I'm talking about the bug discussed in KararTY/dank-twitch-irc#13, which was supposed to be fixed in 20.3.0 :) |
Lol, yes. I know. Coincidentally, two bugs failed in the same assertion. I fixed the first one and now I'm taking care of the second one. |
If we still have bugs that are pending a fix, I think the best thing to do is have a release ASAP that turns it off by default again. The more we wait the more users are going to be bothered by regressions. Improving the implementation of a new feature is not a good justification for regressions, IMO. (Do we really need a full revert like what this PR does though? It seems to me we just need to set the CLI option to false by default again) |
@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately. Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore. |
I don't have a strong opinion either, but It's unlikely to have this PR ready for this week and have it released due to the upcoming security release. Considering the main purpose of this issue was to prevent connectivity issues within IPV6, I feel this is something we'll need to have at some moment. As stated earlier in this thread, if we had to revert, we should have rolled back weeks ago. Reverting it now into a breaking change isn't great IMO. |
How about only disabling in on the v20 branch? I guess v21 can accept a bit more instability as you have seen that v19 barely got any real testing, and that will give plenty of time to fix the issues before v22. |
That might be an acceptable solution, but I'm still pretty skeptical about the approach. I think we might be able to solve all the (major) issues in few weeks. I totally agree that we should have reverted this long ago probably (and I apologize about this), but now the pro and cons of a breaking change lean towards the cons. |
FYI: I fixed all issue I'm aware of in #48464. |
Why do we need a full revert just because it's renamed? We could also keep the name but just turn it off by default again?
Why would you think that would be the case? We could just turn it off and quickly release a patch, then land the fix and reenable it. If bug reports still come in, we do that again. Turning it on and off isn't ideal, but probably better than having a large-scale regression frustrating users for weeks. This happens a lot in Chromium/V8, for example, though they have a different release model (mostly Canary + Stable, but not too different from our Current + LTS I would say). That said now that the security release is in progress, it's probably too late to do this cycle now. But we should probably try something similar for large-scale regressions in the future. |
It's not just the rename of the options. We also have to revert some tests that changed error behavior.
Well, all flipping is technically a breaking change. If we don't care too much then probably is not an issue. |
I think we could just enable the flag in those tests?
The point about not breaking is so that we don't have large-scale regressions. As this feature is very new I doubt anyone is currently relying on it being the default, meanwhile judging from the bug reports, there are definitely more people relying on it not regressing.
I don't see that as a problem when we are already sure about the errors. Again, we could just revert it while we work on known and certain bugs, instead of leaving users frustrated while we work on the bugs that we already know. We can just flip it back when the fix lands to see if the error still reproduces, Chromium/V8 has been doing this for many years and they have a much larger user base than ours. |
And yet, these problems have not been tackled quickly. Node.js 20 was released more than two months ago and #47644 was opened just a day later. Still, every 20.x release since then has contained bugs related to this feature that affect a large number of users. |
I'm not sure where you are going with this. As soon as bugs started popping I promptly tried to fix all of them at my maximum speed. Unfortunately, unless we went for patch release (and, TBH, I don't even know what is the procedure for this), we had to wait for new release to be cut for people to benefit for it. |
Good point. Yes, we can.
We can definitely do that. At the moment I've fixed all the bugs I had a report for. |
I think we all agree that we should have reverted many weeks ago. That's all I wanted to say :) |
It seems things are better now. I'm closing this. We can reopen it if needed. |
This reverts commit 8b51c1a.
I think we were too eager to turn this on. In the Fastify community, we have been trying to get our tests to pass on Node v20 for a month and a half with minimal luck. The only solution was to use
NODE_OPTIONS=no-network-family-autoselection
.Note that tests spuriously fail even with v20.3.0, so the fixes are not complete. I was not able to reproduce the problem individually. My 2 cents is that there is a race condition somewhere triggered by many open sockets in a short period of time.