Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialCancel is not compatible with new macOS ARM64 builders #52579

Open
heschi opened this issue Apr 26, 2022 · 11 comments
Open

net: TestDialCancel is not compatible with new macOS ARM64 builders #52579

heschi opened this issue Apr 26, 2022 · 11 comments
Labels
Builders NeedsInvestigation OS-Darwin Testing
Milestone

Comments

@heschi
Copy link
Contributor

@heschi heschi commented Apr 26, 2022

TestDialCancel requires that traffic to 198.18.0.254 and 2001:2::254 be blackholed. The comment on slowDialTCP notes that "In some environments, the slow IPs may be explicitly unreachable". The network we're using for the new macOS ARM64 builders is one such environment. I skipped the test in https://go.dev/cl/402181, but it might be nice for someone more knowledgeable about this than me to look and decide what should be done.

@heschi heschi added OS-Darwin Builders NeedsInvestigation labels Apr 26, 2022
@heschi heschi added this to the Backlog milestone Apr 26, 2022
@bcmills
Copy link
Member

@bcmills bcmills commented Apr 26, 2022

The comment on slowDialTCP notes that "In some environments, the slow IPs may be explicitly unreachable".

Hmm, I wonder if TestDialCancel should be installing slowDialTCP! It looks like the other tests that refer to slowDst4 and slowDst6 both do. (Those other tests are TestDialerFallbackDelay and TestDialParallel.)

@ianlancetaylor, @neild, @bradfitz: do you think it would make sense to install slowDialTCP for TestDialCancel?

@bcmills bcmills removed this from the Backlog milestone Apr 26, 2022
@bcmills bcmills added this to the Go1.19 milestone Apr 26, 2022
@neild
Copy link
Contributor

@neild neild commented Apr 26, 2022

Looks to me like TestDialCancel (and any other test using the blackholed addresses) should install slowDialTCP. I'll send you a CL.

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 26, 2022

Change https://go.dev/cl/402454 mentions this issue: net: install slowDialTCP hook in TestDialCancel

@dmitshur dmitshur added NeedsFix and removed NeedsInvestigation labels Apr 27, 2022
@bcmills
Copy link
Member

@bcmills bcmills commented May 4, 2022

@gopherbot, please backport to Go 1.17 and 1.18. This test is currently failing on the release branches too.

@gopherbot
Copy link

@gopherbot gopherbot commented May 4, 2022

Backport issue(s) opened: #52705 (for 1.17), #52706 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@dmitshur dmitshur added NeedsInvestigation Testing and removed NeedsFix labels May 4, 2022
@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented May 17, 2022

My understanding of the state of this issue is that it can in Backlog milestone without a release-blocker label. That is, the known-failing test is skipped (via CL 402181) so it's not causing noise on the dashboard (#11811), and to me this test seem safe not to run on darwin/arm64 given it's run elsewhere (i.e., the chance of regression that would affect only this one particular port is low). @bcmills Should we remove release-blocker label, or can you clarify why it should be kept? Thanks.

@bcmills
Copy link
Member

@bcmills bcmills commented May 17, 2022

I agree that, given that the failure mode almost certainly indicates a mismatch in the test's assumptions rather than a bug in the net package per se, it would be ok to appropriately widen the test-skips and then move it to the Backlog milestone.

However, the current skip is based on the builder name, so this test may still fail if someone who uses a similar network configuration imports the net package and then runs go test all in their own module, which I argue ought to be a reasonable thing to do — especially given that darwin/arm64 is a first class port.

I suggest:

  • If we believe that the conditions that cause this failure are specific to the darwin/arm64 platform, then we could widen the skip to GOOS = "darwin" && GOARCH == "arm64" (independent of GO_BUILDER_NAME).
  • If we don't really know what the conditions are, but the failure mode is specific enough to still get a useful signal from running the test, then we could skip the test any time we see that failure mode regardless of platform.
  • If we don't know what conditions cause the failure and either we can't easily detect that failure mode (e.g. it manifests as a timeout) or the test isn't useful otherwise (e.g. that failure mode is the only thing the test is looking for to begin with), then we should probably skip the test on all platforms.

@heschi
Copy link
Contributor Author

@heschi heschi commented May 20, 2022

I don't see any immediate reason to think this is a platform-specific problem. It might be, but it seems much more likely that it's just a quirk of the network we're on. (I still hope to move them to a different network but it is taking a while.)

I don't think skipping the tests is costing us much coverage, per Dmitri's comment, and I don't think there's anything release-blocking left now that we've backported to the skip to the release branches. I would like the net maintainers to make any further decisions on what to do here. (Perhaps there's a better way to write the test?)

cc @ianlancetaylor @neild

@neild
Copy link
Contributor

@neild neild commented May 20, 2022

Is there a builder I can run the test on that will reliably fail?

@heschi
Copy link
Contributor Author

@heschi heschi commented May 20, 2022

@bcmills
Copy link
Member

@bcmills bcmills commented May 23, 2022

@heschi

I don't think there's anything release-blocking left now that we've backported to the skip to the release branches.

I disagree: I think the remaining release-blocking action is for the net maintainers to determine whether (and how) to broaden the skip or fix and unskip the test (as I infer Damien is in the process of doing).

That doesn't necessarily need to be a time-intensive process (a CL to skip a test only takes a couple of minutes to mail the change and set Auto-Submit+1), but it does need to happen before the release in order to tie up the loose ends of supporting darwin/arm64 as a first-class port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders NeedsInvestigation OS-Darwin Testing
Projects
None yet
Development

No branches or pull requests

5 participants