Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: "unexpected stale targets" on darwin-arm64-11_0-toothrot #49692

Closed
bcmills opened this issue Nov 19, 2021 · 11 comments
Closed

x/build: "unexpected stale targets" on darwin-arm64-11_0-toothrot #49692

bcmills opened this issue Nov 19, 2021 · 11 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Nov 19, 2021

The environment fixes in CL 353549 seem to have cleared up the unexpected-staleness issues on most of the darwin builders. However, darwin-arm64-11_0-toothrot in particular is still showing unexpected staleness.

These failures are showing up much later in the build than in #33598, and not on the same packages, so I think the failure mode is just different enough to be a separate underlying cause.

greplogs --dashboard -md -l -e '(?ms)\Adarwin-.*go tool dist: unexpected stale targets' --since=2021-10-09

2021-11-18T02:16:39-353cb71/darwin-arm64-11_0-toothrot
2021-11-15T21:22:18-c8d7c5f/darwin-arm64-11_0-toothrot
2021-11-13T02:30:25-c546052/darwin-arm64-11_0-toothrot
2021-11-12T22:20:50-9150c16/darwin-arm64-11_0-toothrot
2021-11-10T21:32:50-f410786/darwin-arm64-11_0-toothrot
2021-11-05T00:52:06-3839b60/darwin-arm64-11_0-toothrot
2021-10-25T15:43:33-2c66cab/darwin-arm64-11_0-toothrot
2021-10-12T14:32:53-36a265a/darwin-arm64-11_0-toothrot
2021-10-12T12:24:09-4679670/darwin-arm64-11_0-toothrot
2021-10-12T11:00:47-9c1dbdf/darwin-arm64-11_0-toothrot
2021-10-12T06:55:50-d887d3b/darwin-arm64-11_0-toothrot
2021-10-12T04:35:19-6372e7e/darwin-arm64-11_0-toothrot
2021-10-11T22:34:49-d90f0b9/darwin-arm64-11_0-toothrot
2021-10-11T22:32:23-c1b0ae4/darwin-arm64-11_0-toothrot
2021-10-11T22:17:47-b41030e/darwin-arm64-11_0-toothrot
2021-10-11T22:17:41-662c5ee/darwin-arm64-11_0-toothrot
2021-10-11T22:16:44-2ecdf9d/darwin-arm64-11_0-toothrot
2021-10-11T21:58:33-d973bb1/darwin-arm64-11_0-toothrot
2021-10-11T20:46:14-7023535/darwin-arm64-11_0-toothrot
2021-10-11T19:20:12-65ffee6/darwin-arm64-11_0-toothrot

(CC @golang/release)

@bcmills bcmills added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin labels Nov 19, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 19, 2021
@bcmills
Copy link
Member Author

bcmills commented Nov 19, 2021

Going back a bit further in the history:

greplogs --dashboard -md -l -e '(?ms)\Adarwin-arm64.*go tool dist: unexpected stale targets'

2021-11-18T02:16:39-353cb71/darwin-arm64-11_0-toothrot
2021-11-15T21:22:18-c8d7c5f/darwin-arm64-11_0-toothrot
2021-11-13T02:30:25-c546052/darwin-arm64-11_0-toothrot
2021-11-12T22:20:50-9150c16/darwin-arm64-11_0-toothrot
2021-11-10T21:32:50-f410786/darwin-arm64-11_0-toothrot
2021-11-05T00:52:06-3839b60/darwin-arm64-11_0-toothrot
2021-10-25T15:43:33-2c66cab/darwin-arm64-11_0-toothrot
2021-10-12T14:32:53-36a265a/darwin-arm64-11_0-toothrot
2021-10-12T12:24:09-4679670/darwin-arm64-11_0-toothrot
2021-10-12T11:00:47-9c1dbdf/darwin-arm64-11_0-toothrot
2021-10-12T06:55:50-d887d3b/darwin-arm64-11_0-toothrot
2021-10-12T04:35:19-6372e7e/darwin-arm64-11_0-toothrot
2021-10-11T22:34:49-d90f0b9/darwin-arm64-11_0-toothrot
2021-10-11T22:32:23-c1b0ae4/darwin-arm64-11_0-toothrot
2021-10-11T22:17:47-b41030e/darwin-arm64-11_0-toothrot
2021-10-11T22:17:41-662c5ee/darwin-arm64-11_0-toothrot
2021-10-11T22:16:44-2ecdf9d/darwin-arm64-11_0-toothrot
2021-10-11T21:58:33-d973bb1/darwin-arm64-11_0-toothrot
2021-10-11T20:46:14-7023535/darwin-arm64-11_0-toothrot
2021-10-11T19:20:12-65ffee6/darwin-arm64-11_0-toothrot
2021-10-08T17:32:25-a7d3a0e/darwin-arm64-11_0-toothrot
2021-09-23T17:04:30-13f3c57/darwin-arm64-11_0-toothrot
2021-09-23T16:54:46-5961134/darwin-arm64-11_0-toothrot
2021-09-23T16:08:00-335e72b/darwin-arm64-11_0-toothrot
2021-09-23T15:51:39-24c2ee7/darwin-arm64-11_0-toothrot
2021-09-23T15:10:56-abbfec2/darwin-arm64-11_0-toothrot
2021-08-23T19:27:46-3081f81/darwin-arm64-11_0-toothrot
2021-04-21T09:07:09-5f1df26/darwin-arm64-11_0-toothrot
2021-04-16T18:29:56-492eb05/darwin-arm64-11_0-toothrot
2021-04-14T19:32:31-bcbde83/darwin-arm64-11_0-toothrot
2021-03-15T14:36:23-7bfe32f/darwin-arm64-11_0-toothrot
2021-03-10T15:24:31-818f6b1/darwin-arm64-11_0-toothrot
2021-03-04T16:33:06-9a40dee/darwin-arm64-11_0-toothrot
2021-02-19T08:48:55-7764ee5/darwin-arm64-11_0-toothrot
2021-02-18T21:09:46-eb98272/darwin-arm64-11_0-toothrot
2021-02-17T00:04:02-70c37ee/darwin-arm64-11_0-toothrot
2021-01-28T16:35:06-41bb49b/darwin-arm64-11_0-toothrot
2021-01-15T21:46:25-9f83418/darwin-arm64-11_0-toothrot
2021-01-11T19:45:02-7593090/darwin-arm64-11_0-toothrot
2021-01-01T20:05:20-3dd5867/darwin-arm64-11_0-toothrot
2019-11-13T18:03:37-e762378/darwin-arm64-corellium

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Nov 19, 2021
@bcmills
Copy link
Member Author

bcmills commented Nov 19, 2021

Probably the first step is to wait and see whether this still occurs on macOS 12 (#49149).

(If it does not, perhaps we can chalk it up to a kernel bug of some sort and move on.)

@toothrot toothrot self-assigned this Dec 8, 2021
@bcmills
Copy link
Member Author

bcmills commented Dec 8, 2021

Still only seen on the darwin-arm64-11_0-toothrot builder — the macOS 12 builder appears not to be affected.
But if we're going to maintain nominal support for macOS 11, I suppose we need to keep the builder around, which means we can't just chalk it up to a kernel bug. 😞

greplogs --dashboard -md -l -e '(?ms)\Adarwin-arm64.*go tool dist: unexpected stale targets' --since=2021-11-19

2021-12-08T04:14:00-a19e72c/darwin-arm64-11_0-toothrot
2021-12-08T01:23:09-016e6eb/darwin-arm64-11_0-toothrot
2021-12-06T22:35:32-6180c4f/darwin-arm64-11_0-toothrot
2021-12-01T15:43:08-ab79055/darwin-arm64-11_0-toothrot
2021-11-29T22:02:45-1970e3e/darwin-arm64-11_0-toothrot
2021-11-29T19:21:29-4325c37/darwin-arm64-11_0-toothrot
2021-11-27T23:29:50-0fa53e4/darwin-arm64-11_0-toothrot
2021-11-22T21:52:20-100d7ea/darwin-arm64-11_0-toothrot
2021-11-19T21:59:14-5e774b0/darwin-arm64-11_0-toothrot

@bcmills
Copy link
Member Author

bcmills commented Dec 8, 2021

The last time we saw unexpected stale targets on a darwin builder it was because of an inconsistent process environment — we were failing to set PWD in some cases and as a result we were getting different symlink resolution for /tmp (which is a symlink to /private/tmp on recent macOS).

So a good debugging step here might be to modify cmd/dist.checkNotStale to dump the process environment, which we can then check for differences against the environment logged at the beginning of the test.

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cagedmantis
Copy link
Contributor

Checking in on this as a release-blocking issue. Are there any updates?

@bcmills
Copy link
Member Author

bcmills commented Jan 26, 2022

Still occurring regularly. I looked at some of the failures in more detail, but couldn't spot any particular pattern, beyond that it's always just the one builder (always 11_0, never 12_0). The staleness check triggers partway through running tests, but not at any consistent point during the build and not for any one particular package.

I almost wonder if the packages are being reported as stale because something is being corrupted — either the source code in GOROOT/src, or the precompiled libraries in GOROOT/pkg. (#50706 is somewhat similar, but the corruption observed there is in the build cache rather than in GOROOT proper.)

Is it possible that the machine running this builder has a bad disk, or bad RAM?

greplogs --dashboard -md -l -e '(?ms)\Adarwin-arm64.*go tool dist: unexpected stale targets' --since=2021-12-09

2022-01-24T12:26:25-0ef6dd7/darwin-arm64-11_0-toothrot
2021-12-22T18:38:41-d7b035f/darwin-arm64-11_0-toothrot
2021-12-21T03:55:43-2d1d548/darwin-arm64-11_0-toothrot
2021-12-21T01:10:17-e087949/darwin-arm64-11_0-toothrot
2021-12-15T20:26:03-bc0aba9/darwin-arm64-11_0-toothrot

@bcmills
Copy link
Member Author

bcmills commented Jan 28, 2022

Checking in on this as a release-blocking issue.

Looking at the longer history in #49692 (comment), it appears that this failure mode has existed ever since this builder was turned up. The builder was defined in dashboard/builders.go in CL 278432 on Dec. 15, 2020, and the first failure in the logs was Jan. 1, 2021, only a couple of weeks later.

According to the Go Porting Policy, ‘Any port started during a release cycle must be finished (all.bash passing, builder reporting "ok") before the corresponding release freeze, or else the code will be removed at the freeze.’ By my reading of that policy, we technically should not have even accepted the darwin/arm64 port into the release with the builder still not passing reliably; however, the port was fast-tracked due to the sudden availability of ARM-based macOS devices (#41385 (comment)), and the novel failure mode for the new port was masked by its similarity to the existing failure mode from #33598. (Tellingly, #41385 (comment) described it as “pretty much on parity with the darwin/amd64 port”, not “passing reliably”.)

As with #39349 (comment), since this failure mode does not seem to be a regression in Go 1.18 proper, I would be ok with moving it to the Go1.19 milestone. However, since the port technically should not have been accepted in the first place in this state — and especially since this builder is for a first-class port (#43814)! — I believe this should be treated as a release-blocker for Go 1.19 and not deferred past that release.

@toothrot toothrot modified the milestones: Go1.18, Go1.19 Feb 2, 2022
@bcmills
Copy link
Member Author

bcmills commented Feb 15, 2022

Here's a wacky variation on this failure mode:

go tool dist: FAILED: go list -gcflags=all= -ldflags=all= -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: signal: illegal instruction

greplogs --dashboard -md -l -e 'go tool dist: .* std: signal: illegal instruction'

2022-02-11T19:36:36-0bde2cf/darwin-arm64-11_0-toothrot
2021-04-16T21:19:23-9e8a312/darwin-arm64-11_0-toothrot
2021-03-23T01:21:24-d25476e/darwin-arm64-11_0-toothrot

That seems to point to memory corruption on the builder during the staleness check. I wonder whether this is a kernel bug, or maybe bad RAM on the builder machine.

@bcmills
Copy link
Member Author

bcmills commented Mar 8, 2022

Another symptom that seems to point toward memory corruption on this builder:

greplogs --dashboard -md -l -e '(?ms)\Adarwin-.*runtime: marked free object in span'

2022-03-08T02:01:53-31be628/darwin-arm64-11_0-toothrot

@toothrot
Copy link
Contributor

I've recently updated the OS several minor versions (I believe 11.0 to 11.6.5) in #51851, which may impact this issue.

@bcmills
Copy link
Member Author

bcmills commented Apr 22, 2022

The aforementioned OS updates on 2022-03-22 appear to have worked! Closing until / unless we see this again.

greplogs --dashboard -md -l -e '(?ms)\Adarwin-arm64.*go tool dist: unexpected stale targets' --since=2022-03-21

2022-03-21T18:58:42-79103fa/darwin-arm64-11_0-toothrot

greplogs --dashboard -md -l -e 'go tool dist: .* std: signal: illegal instruction' --since=2022-03-21

[no results]

greplogs --dashboard -md -l -e '(?ms)\Adarwin-.*runtime: marked free object in span' --since=2022-03-21

[no results]

@bcmills bcmills closed this as completed Apr 22, 2022
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 22, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants