Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: failures are sometimes missing error output #39349

Open
bcmills opened this issue Jun 1, 2020 · 20 comments
Open

x/build/cmd/coordinator: failures are sometimes missing error output #39349

bcmills opened this issue Jun 1, 2020 · 20 comments
Labels
Builders NeedsInvestigation release-blocker Testing
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Jun 1, 2020

https://build.golang.org/log/dab786adef1a18622f61641285864ac9c63fb7e3 is marked with fail on the dashboard, but the word FAIL does not appear in the output file at all.

Either the output is truncated, or the last test in the log (misc/cgo/testshared) exited with a nonzero status and no output.

CC @dmitshur @toothrot @cagedmantis

@bcmills bcmills added Builders NeedsInvestigation labels Jun 1, 2020
@gopherbot gopherbot added this to the Unreleased milestone Jun 1, 2020
@bcmills
Copy link
Member Author

bcmills commented Aug 27, 2021

The failures are still occurring, but they now have one extra line of output: a signal: terminated line at the end of the test.

$ greplogs --dashboard -md -l -e (?m)##### .*\n.*signal: terminated\n\z

2021-08-27T01:08:12-c927599/linux-386-longtest
2021-08-26T17:19:04-166b691/linux-386-longtest
2021-06-02T17:34:43-e11d142/linux-386-longtest
2021-05-27T15:00:58-950fa11/linux-386-longtest
2021-04-28T18:50:41-ea65a12/linux-386-longtest

@bcmills
Copy link
Member Author

bcmills commented Jan 5, 2022

greplogs --dashboard -md -l -e '(?m)##### .*\n.*signal: terminated\n\z' --since=2021-08-28

2022-01-04T22:27:20-f154f8b/linux-386-longtest
2022-01-04T21:10:15-1c8f9d2/linux-386-longtest
2021-12-23T20:03:38-b357b05/linux-386-longtest
2021-12-22T18:43:55-0f3becf/linux-386-longtest
2021-12-13T15:47:14-36db10f/linux-386-longtest
2021-12-07T00:58:10-0eb39ca/linux-386-longtest
2021-12-06T17:34:53-2cb9042/linux-386-longtest
2021-12-04T01:07:10-ba83aa7/linux-386-longtest
2021-10-18T16:32:23-4251541/linux-386-longtest
2021-10-18T16:06:08-417100e/linux-386-longtest
2021-10-13T17:02:43-b5904f3/linux-386-longtest
2021-10-13T16:41:20-aded167/linux-386-longtest
2021-10-13T15:38:39-b8e4df0/linux-386-longtest
2021-10-11T19:20:12-65ffee6/linux-386-longtest
2021-10-06T20:30:12-4a37a1d/linux-386-longtest
2021-10-06T20:29:59-8238f82/linux-386-longtest
2021-10-01T17:40:49-243d65c/linux-386-longtest
2021-09-21T20:29:00-f6f6621/linux-386-longtest
2021-09-17T23:44:10-4b654c0/linux-386-longtest
2021-09-17T23:19:17-163871f/linux-386-longtest
2021-09-17T18:33:15-70493b3/linux-386-longtest
2021-09-16T23:57:28-af9da13/linux-386-longtest
2021-09-15T03:29:43-738cebb/linux-386-longtest
2021-09-14T23:07:15-137543b/linux-386-longtest
2021-09-14T14:27:57-181e8cd/linux-386-longtest
2021-09-14T00:49:39-4a4221e/linux-386-longtest
2021-09-12T16:46:58-ad97d20/linux-386-longtest
2021-09-09T22:18:05-426ff37/linux-386-longtest
2021-09-08T14:25:12-64bdad2/linux-386-longtest
2021-09-08T06:38:19-da790cc/linux-386-longtest
2021-09-08T00:02:45-963218c/linux-386-longtest
2021-09-07T18:37:34-6640171/linux-386-longtest
2021-09-04T15:01:53-5c224ec/linux-386-longtest
2021-09-03T19:56:09-04d8d24/linux-386-longtest
2021-09-03T04:24:17-17910ed/linux-386-longtest
2021-08-30T20:22:43-8250141/linux-386-longtest
2021-08-28T00:51:39-5fb1771/linux-386-longtest

@bcmills
Copy link
Member Author

bcmills commented Jan 5, 2022

This seems like a very high failure rate, and linux-386 is a first-class port. Marking as release-blocker.

Given that the recent failures are only on the 386-longtest builder, I suspect that this is due to some test running out of address space. Probably either the builder itself needs to be tuned or the ../test runner harness needs to reduce its parallelism on 32-bit platforms.

@bcmills bcmills modified the milestones: Unreleased, Go1.18 Jan 5, 2022
@bcmills bcmills added release-blocker Testing labels Jan 5, 2022
@bcmills
Copy link
Member Author

bcmills commented Jan 5, 2022

I wonder if this could also just be a timing issue. Maybe the coordinator isn't giving the linux-386-longtest builder long enough to run all of the tests? (CC @golang/release)

@aclements
Copy link
Member

aclements commented Jan 19, 2022

Some of these failures from the second comment could be because dist doesn't implement timeouts on ../test and some ../misc/cgo tests. That wouldn't explain truncation on any of the standard library tests, or the more recent pattern with "signal: terminated".

@toothrot toothrot added this to Planned in Go Release Team Jan 25, 2022
@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 27, 2022

The failures all indicate that cmd/dist failed due to a SIGTERM signal. I don't understand where that signal is coming from. The Linux kernel will never send a SIGTERM to a process. I don't see any case in which the coordinator sends a SIGTERM.

It appears that Docker will send a SIGTERM signal when the docker container is being shut down. Perhaps that is what is happening here. It does seem that there is a timeout for the instance, but the code is complicated enough that I don't understand how it is set. For these tests just the runtime test alone takes over 10 minutes, so it does take a while to run.

Is there any logging of timeouts done by the coordinator? I don't know this code myself.

I'm not sure this is a release blocker because it looks to me like an issue with the builder system rather than something that will be fixed in the release.

@bcmills
Copy link
Member Author

bcmills commented Jan 28, 2022

I'm not sure this is a release blocker because it looks to me like an issue with the builder system rather than something that will be fixed in the release.

My understanding from https://go.dev/wiki/PortingPolicy#first-class-ports is that first-class ports are expected to have passing tests at each release, and linux/386 is listed as a first-class port. I don't know of anyone running the linux/386 tests locally with any regularity, so if they aren't even reliably running to completion on the Go builders I don't think we can say with certainty that they are consistently passing.

The inflection point of the failure rate of this builder appears to have been around August 2020. Considering the state of the world in August 2020, I can understand why it wasn't addressed immediately at that time, but I do not know why it remained unaddressed at the time of the Go 1.17 release in 2021. I assume that it was overlooked because it lacked the release-blocker label, so when I was re-triaging the failures on the build dashboard I added that label to ensure that it would not be similarly overlooked for the current release.

@bcmills
Copy link
Member Author

bcmills commented Jan 28, 2022

it looks to me like an issue with the builder system rather than something that will be fixed in the release.

In addition to the interaction with the porting policy, to me this kind of issue is also a matter of equity.

I watch the builders to check whether there have been regressions in the parts of the project for which I am responsible, and failures in the linux-386-longtest builder matter to me: for one, many of the cmd/go tests only run on the longtest builders, and for two, many of the fuzzing tests I have reviewed this cycle have behaviors unique to the linux-386-longtest builder, because that builder has the unique combination of running non-short tests and having a non-amd64 GOARCH (which impacts fuzzing instrumentation).

So when I see failures on this builder, I check them. A significant rate of false-positive failures causes a significant amount of unproductive, avoidable triage work, and that in turn contributes to feelings of frustration and burnout. Since the Go project does not seem to have anyone else triaging new or existing builder failures with any regularity, I feel that the costs of this ongoing flakiness have been externalized on to me.

#33598 went through a trajectory very similar to this issue: we had a series of recurring failures on the builders for darwin/amd64, which is also nominally a first-class port. I identified a way to reliably reproduce the problem in March 2020, and the issue remained unaddressed until I diagnosed and fixed it myself in October 2021 (CL 353549).

#39665 was also similar: Dmitri reported longtest failures on windows/amd64 (also a first-class port) in June 2020, and no apparent progress was made on even a diagnosis until I reported a new failure mode in November 2021 (in #49457) and marked it as a release-blocker, at which point the underlying issue was apparently fixed.

If we consider subrepo tests, there are many more examples. As I understand it from #11811 the policy of the project is that subrepo tests should be passing before a release, but for at least the past couple of years we have cut releases with frequently- or persistently-broken subrepo builders. (Some examples: #45700, #31567, #36163; the latter on windows/amd64, which is a first-class port.)

My takeaway from those cases is that persistent builder issues generally will not be addressed unless I address them myself (as in #33598, #31567, and #36163), or they actively interfere with running x/build/cmd/releasebot, or they are explicitly marked with the release-blocker label.

Letting these kinds of persistent issues linger was understandable as a short-term situation in 2020, but it isn't tenable as a steady state for a large, staffed project. We all want to land releases, and our existing policy (at least as I understand it) is that to land a release we also need to keep the build healthy. Adhering to that policy provides some backpressure on the accumulation of unplanned technical debt, and helps to “internalize the externality” of flaky and broken builders.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 28, 2022

I feel your pain, and many thanks for doing this unrewarding work.

Still, this seems like a process problem. I see no reason that an issue like this should block a release. If the linux-386-longtest builder failed 50% of the time then it would have to block the release because it might be hiding other problems. But that's not the case here; we are getting enough successful runs to have reason to believe that the build is basically OK.

You are pointing out that if these issues are not marked as release blockers, then they will never be addressed. That is the problem to fix. We shouldn't get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release.

So I sympathize with the approach that you are taking, but I think we need to develop a different approach.

@bcmills
Copy link
Member Author

bcmills commented Jan 28, 2022

Still, this seems like a process problem.

I agree, but I think we already have a (documented but unenforced) solution to the process problem: namely, to align the incentives between shipping the release and maintaining a healthy build.

We shouldn't get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release.

I agree, but we could also avoid getting ourselves into that position by identifying and addressing new failures when they are introduced.

If the builders are consistently passing, then they will almost always remain so unless failures are introduced or exposed by specific code or configuration changes. If we respond to new failures immediately, we can more easily identify the likely causes, and either roll back or adapt to the changes that introduced them. As I understand it, the release freeze exists precisely to allow time for new problems to be addressed before the scheduled release.

For example, the spike in the linux-386-longtest failure rate in August 2020 was at essentially the same time that the tree opened for development on Go 1.16, with six failures that month. If we had identified the failures and started investigating in depth then, we would have still had five months of development and freeze remaining in which to diagnose and mitigate the problem before even a single release window passed.

For another example, the failure rate for the NetBSD and OpenBSD builders spiked during the current cycle (#49209 and related issues) due to a latent platform bug exposed by a configuration change in mid-October. The compiler & runtime team began diagnosing it in earnest in early November, the builders were reconfigured to diagnose the problem, and a configuration that works around the problem was identified on Nov. 29 (#49209 (comment)) — all within the normal release freeze window (if barely). The process worked, but I suspect that it only even started when it did because I tagged that issue release-blocker on Oct. 28.

If we follow the process and fix the problems as they are introduced, it doesn't need to hold up the release. The apparent tension between fixing the builds and cutting the release on time is, as far as I can tell, a consequence of putting off the build-fixes until the release is already looming.

@bcmills
Copy link
Member Author

bcmills commented Jan 28, 2022

If we want to call a mulligan on the process problem, I suggest that we treat this issue as a release-blocker, but for Go 1.19 rather than 1.18.

I firmly believe that it should be a release-blocker for some release in the near future, because otherwise the incentives of the process are misaligned and we have no check on accumulating an unbounded amount of unplanned technical debt.

However, I agree that it does not indicate a regression specific to the 1.18 release, which is by all accounts a busy release cycle and major milestone and already delayed enough as it is.

Making it a release-blocker for Go 1.19 would allow adequate time to ship the 1.18 release, and provide an entire development window in which to investigate and mitigate this issue.

@dmitshur
Copy link
Contributor

dmitshur commented Jan 28, 2022

In our team meeting this week, we've discussed this and agreed as the next step here to move this to 1.19 milestone, to give us a chance to prioritize investigation of this issue during the 1.19 dev cycle. To make progress here, we need to do development work on x/build/cmd/coordinator, and the 1.18 freeze isn't the right period for that work.

For the 1.18 freeze, if there are linux-386-longtest test failures that happen at a rate that is more than can be attributed to flaky failures, we'll need to look for optimal ways to reproduce and understand them even if the build logs do not include as much information as we'd like them to.

Updating this issue so that it's more up to date with the latest reality, and we can adjust it further as needed.

@dmitshur dmitshur modified the milestones: Go1.18, Go1.19 Jan 28, 2022
@dmitshur dmitshur added the early-in-cycle label Jan 28, 2022
@dmitshur dmitshur changed the title x/build: longtest failures missing error output x/build/cmd/coordinator: longtest failures are sometimes missing error output Jan 28, 2022
@gopherbot
Copy link

gopherbot commented Mar 16, 2022

This issue is currently labeled as early-in-cycle for Go 1.19.
That time is now, so a friendly reminder to look at it again.

@heschi heschi removed this from Planned in Go Release Team Mar 16, 2022
@dmitshur dmitshur added this to Planned in Go Release Team Mar 29, 2022
@bcmills
Copy link
Member Author

bcmills commented Mar 30, 2022

It appears that the regexp I used in #39349 (comment) was even under-counting: the signal: terminated message does not occur on Windows. Widening the regexp to include that, we have:

greplogs --dashboard -md -l -e '(?m)##### .*\n(?:.*signal: terminated\n|\s*)\z' --since=2022-01-05

2022-03-21T18:58:42-79103fa/windows-386-2008
2022-01-28T22:21:43-25b4b86/linux-386-longtest
2022-01-27T00:03:31-f4aa021/linux-386-longtest
2022-01-24T12:26:25-0ef6dd7/linux-386-longtest
2022-01-07T17:55:52-98ed916/linux-386-longtest

@bcmills
Copy link
Member Author

bcmills commented Apr 5, 2022

One more today:

greplogs --dashboard -md -l -e '(?m)##### .*\n(?:.*signal: terminated\n|\s*)\z' --since=2022-03-22

2022-04-05T09:26:01-a041a75/windows-386-2008

@bcmills bcmills changed the title x/build/cmd/coordinator: longtest failures are sometimes missing error output x/build/cmd/coordinator: failures are sometimes missing error output Apr 7, 2022
@dmitshur dmitshur removed the early-in-cycle label May 11, 2022
@aclements
Copy link
Member

aclements commented May 31, 2022

Coordinator folks, what builder logs would we expect if cmd/dist itself got stuck? Does the buildlet or coordinator impose a timeout that would be logged, or would we expect the coordinator to just kill off the VM and the log to stop?

@aclements
Copy link
Member

aclements commented May 31, 2022

@dmitshur says that until two weeks ago, if cmd/dist wedged, the builder would eventually get killed and the build retried. And as of two weeks ago, it will now get killed and clearly marked as timed-out. Which suggests this is not cmd/dist wedging.

@dmitshur
Copy link
Contributor

dmitshur commented May 31, 2022

@aclements The relevant issue for the recent change is #42699 (comment). (Maybe it was 1 week ago instead of 2.)

@cagedmantis also pointed out this the occasional coordinator restarts may be contributing to this problem. Issue #51057 is related to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders NeedsInvestigation release-blocker Testing
Projects
Development

No branches or pull requests

5 participants