Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: runSubrepoTests ( repo tests) should also check maxTestExecErrors constant #36226

dmitshur opened this issue Dec 20, 2019 · 0 comments


Copy link

@dmitshur dmitshur commented Dec 20, 2019

In my investigation at #35581 (comment), I wrote:

It is intentional to keep retrying "communications failures" forever, because the expectation is that they should eventually succeed.

I'm seeing now that this isn't quite true. There is a constant defined:

// maxTestExecError is the number of test execution failures at which
// we give up and stop trying and instead permanently fail the test.
// Note that this is not related to whether the test failed remotely,
// but whether we were unable to start or complete watching it run.
// (A communication error)
const maxTestExecErrors = 3

The runTestsOnBuildlet method, which is called by runTests method, has block that checks if ti.numFail has reached maxTestExecErrors:

if err != nil {
	bc.MarkBroken() // prevents reuse
	for _, ti := range tis {
		st.logf("Execution error running %s on %s: %v (numFails = %d)",, bc, err, ti.numFail)
		if err == buildlet.ErrTimeout {
			ti.failf("Test %q ran over %v limit (%v); saw output:\n%s",, timeout, execDuration, buf.Bytes())
		} else if ti.numFail >= maxTestExecErrors {
			ti.failf("Failed to schedule %q test after %d tries.\n",, maxTestExecErrors)
		} else {

However, the runTests method is only used for the main Go repository, not repos:

if st.IsSubrepo() {
	remoteErr, err = st.runSubrepoTests()
} else {
	remoteErr, err = st.runTests(st.getHelpers())

So this bug is about making the repos path also use the maxTestExecErrors constant and give up after some number of tries.

It's low value to fix because we rarely run into a situation where communication errors happen 3 times or more; that happens most often due to other bugs which we need to fix anyway.

/cc @bradfitz @cagedmantis @toothrot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant