prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

pohly · 2022-04-21T19:17:37Z

What would you like to be added?

This was previously tracked in #34059 (created in 2016, closed in 2020 after several tests were enhanced).

I'm starting to wonder whether the problem is getting worse again because we still don't catch new tests that use this anti-pattern (for example, #107763).

Is it time to start another spring clean which removes occurrences of this anti-pattern? Or perhaps add some mechanism that prevents it in new code?

/sig testing
/area e2e-test-framework

Why is this needed?

I'm getting tired of seeing failures where the only output is

Expected error:
    <*errors.errorString | 0xc820412c10>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

This is the result of

Expect(err).NotTo(HaveOccurred())

or more recently

ExpectNoError(err)

An explanation string would help, but even better solutions would be possible (#106575).

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-04-21T19:17:43Z

@pohly: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2022-04-21T19:27:50Z

One thing that we could do is add an empty string as explanation with a TODO comment to all current calls that don't have one and then make the explanation parameter(s) required in

kubernetes/test/e2e/framework/expect.go

Lines 38 to 41 in f173d01

    
           // ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error. 
        
           func ExpectNoError(err error, explain ...interface{}) { 
        
           	ExpectNoErrorWithOffset(1, err, explain...) 
        
           }

Code without an explanation then would look like this:

ExpectNoError(err, "" /* TODO: add more details about the error, see https://github.com/kubernetes/kubernetes/issues/109600*/)

pohly · 2022-04-21T19:30:27Z

The "TODO" could be removed in places where the error is known to be informative enough. The problem is that this can only be determined by examining the code. It also might not be obvious. I'm leaning to always requiring a non-empty string, everything else seems rather fragile.

tallclair · 2022-04-22T18:16:23Z

+1 for requiring a non-empty string.

I suggest we update all the functions in kubernetes/test/e2e/framework/expect.go to make adding some context non-optional, for example:

// ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error.
// context should describe where the error came from, such as the intent of the call that produced the error.
func ExpectNoError(err error, contextMsg string, explain ...interface{}) { 
 	if contextMsg == "" { /* fail */ }
 	explain = append([]interface{}{contextMsg}, explain...)
 	ExpectNoErrorWithOffset(1, err, explain...) 
 }

As a first pass, callers could just be updated to set the contextMsg to "TODO"

AxeZhan · 2022-04-23T13:47:14Z

Hey guys, can I try to fix this one?
I'm going to change ExpectNoError like @tallclair said

// ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error.
func ExpectNoError(err error, contextMsg string, explain ...interface{}) {
	if len(contextMsg) == 0 {
		contextMsg = "Add more details about the error, see https://github.com/kubernetes/kubernetes/issues/109600"
	}
	ExpectNoErrorWithOffset(1, err, append([]interface{}{contextMsg}, explain...))
}

and for those calls without contextMsg, I can temporarily replace it with an empty string,like
framework.ExpectNoError(err,"")

so it can print the message about this issue,and it doesn't affect compilation

pohly · 2022-04-25T06:26:00Z

I would just continue to call it explain instead of contextMsg because that is what the parameter was called before. We are not introducing a new concept, we just make the existing one mandatory. One thing that should be clarified on this occasion is that "explain" is a format string. The framework wrappers don't mention this at the moment.

How does this sound?

// ExpectNoError checks if "err" is set, and if so, fails the assertion while logging the error.
//
// An additional printf-style explanation must be provided because often errors
// cannot be understood without it. In cases where the error is known to
// be descriptive enough in all cases, an empty explainFormatStr can be
// passed.
func ExpectNoError(err error, explainFormatStr string, explainArgs ...interface{})

It's worth calling out that we loose some functionality when doing this: previously, explain could also be a func() string which was called only if needed. But I'm pretty sure that this isn't known widely enough that it got used much, if at all, and it shouldn't be a big loss if we don't support it. If someone wants it, they can still call Gomega directly.

I agree that just passing "TODO" sounds like a good idea.

Do we want to do this in one PR or in multiple? If we do it in one, we pretty much need a root approver, otherwise it will be blocked until all the various tests owners add their approval. It could easily get stuck in rebase hell. If we split it up along ownership boundaries, we still need lots of approvals, but might have to rebase less. I'd prefer a single PR that goes in early in the 1.25 cycle, if we can get that through.

@kidddddddddddddddddddddd: you are welcome to work on this, but let's first clarify exactly what we want and how to do it.

AxeZhan · 2022-04-25T07:02:28Z

I would just continue to call it explain instead of contextMsg because that is what the parameter was called before. We are not introducing a new concept, we just make the existing one mandatory. One thing that should be clarified on this occasion is that "explain" is a format string. The framework wrappers don't mention this at the moment.

How does this sound?
// ExpectNoError checks if "err" is set, and if so, fails the assertion while logging the error.
//
// An additional printf-style explanation must be provided because often errors
// cannot be understood without it. In cases where the error is known to
// be descriptive enough in all cases, an empty explainFormatStr can be
// passed.
func ExpectNoError(err error, explainFormatStr string, explainArgs ...interface{})
It's worth calling out that we loose some functionality when doing this: previously, explain could also be a func() string which was called only if needed. But I'm pretty sure that this isn't known widely enough that it got used much, if at all, and it shouldn't be a big loss if we don't support it. If someone wants it, they can still call Gomega directly.

I agree that just passing "TODO" sounds like a good idea.

Do we want to do this in one PR or in multiple? If we do it in one, we pretty much need a root approver, otherwise it will be blocked until all the various tests owners add their approval. It could easily get stuck in rebase hell. If we split it up along ownership boundaries, we still need lots of approvals, but might have to rebase less. I'd prefer a single PR that goes in early in the 1.25 cycle, if we can get that through.

@kidddddddddddddddddddddd: you are welcome to work on this, but let's first clarify exactly what we want and how to do it.

I also prefer a single PR.
But how do we modify the current function calls.Should we just add an empty explainFormatStr string or add a "TODO". Like you said, in cases, an empty string can be passed.
I think we can pass a "TODO" as explainFormatStr for all those calls without explain, but this will result in more cleanup later.
Or we can add some simple explanation for every call, but this will make the review harder.

pohly · 2022-04-25T07:31:36Z

But how do we modify the current function calls.Should we just add an empty explainFormatStr string or add a "TODO".

Such a single PR must be very simple, otherwise we get bogged down in discussions around what individual strings should be. Therefore explainFormatStr = "TODO" makes sense to me. Later code owners can and should spend some of their own time on addressing the TODOs in their tests.

AxeZhan · 2022-04-25T07:59:36Z

Such a single PR must be very simple, otherwise we get bogged down in discussions around what individual strings should be. Therefore explainFormatStr = "TODO" makes sense to me. Later code owners can and should spend some of their own time on addressing the TODOs in their tests.

Sounds good to me,I'll make a simple PR in a day or two.

AxeZhan · 2022-04-25T08:00:41Z

/assign

k8s-triage-robot · 2022-07-24T08:31:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pohly · 2022-08-01T14:05:58Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-30T14:40:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pohly · 2022-10-31T07:04:20Z

/remove-lifecycle stale

tallclair · 2022-11-01T16:30:50Z

/lifecycle frozen

pohly · 2022-12-19T10:08:32Z

We are not done with this and it is still a problem. The number one error on https://storage.googleapis.com/k8s-triage/index.html right now is:

Dec  4 07:05:39.498: timed out waiting for the condition
test/e2e/framework/framework.go:241
k8s.io/kubernetes/test/e2e/framework.(*Framework).BeforeEach(0xc0012dc5a0)
	test/e2e/framework/framework.go:241 +0x96f

kubernetes/test/e2e/framework/framework.go

Lines 247 to 248 in cbd16cf

    
           err = WaitForDefaultServiceAccountInNamespace(ctx, f.ClientSet, namespace.Name) 
        
           ExpectNoError(err)

danwinship · 2023-01-30T14:35:06Z

So I suggested on #109728 that this is not an ExpectNoError() problem, it's a utilwait problem. Every time someone complains about error messages with no context, they are complaining about "timed out waiting for the condition". Forcing people to add additional context to other ExpectNoError calls just results in redundant error messages. So let's just redo the utilwait methods. Eg, add a required "waitingFor string" argument, so it can return fmt.Errorf("timed out waiting for %s", waitingFor). Or have the functions return (bool, error) instead of error, where false, nil means "timed out" and the caller then has to construct their own contextually-relevant error message. (If we want better error messages, banning fmt.Errorf("constant string") seems better than banning ExpectNoError(err).)

Changing the API would also provide an opportunity to add a way to say "if you time out, return the last error returned from the ConditionFunc rather than a timeout error", which is pretty common.

Maybe they could all take an input struct like Backoff which would allow squashing a lot of the variants down into fewer functions.

pohly · 2023-01-30T16:18:01Z

So let's just redo the utilwait methods.

I agree that this must be the first step. I wrote down some thoughts about what polling should support here:
kubernetes/community#7021

I implemented what I envision to meet these requirements here:
#113298

But I still think that adding additional information in ExpectNoError() can be useful. For example, consider a test which calls the same helper for two different pods:

framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod1, ...))
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod2, ...))

Even if the improved error now includes the pod name, test failures become better when explaining what the pod is for:

framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod1, ...), "server pod")
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod2, ...), "client pod")

I'm coming to the conclusion that there's no technical solution for this: test authors and reviewers will have to consider manually whether it should be called with an additional explanation.

pohly added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 21, 2022

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework labels Apr 21, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 21, 2022

k8s-ci-robot assigned AxeZhan Apr 25, 2022

pohly mentioned this issue Apr 25, 2022

e2e: To exercise deletion of PV prior to deletion of PVC for a Bound PV-PVC pair #108872

Closed

AxeZhan mentioned this issue Apr 25, 2022

Make ExpectNoError explainable #109652

Closed

pohly mentioned this issue May 2, 2022

verify-golangci-lint.sh: support stricter checking in new code #109728

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

pohly commented Apr 21, 2022

k8s-ci-robot commented Apr 21, 2022

pohly commented Apr 21, 2022

pohly commented Apr 21, 2022

tallclair commented Apr 22, 2022

AxeZhan commented Apr 23, 2022

pohly commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

pohly commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

k8s-triage-robot commented Jul 24, 2022

pohly commented Aug 1, 2022

k8s-triage-robot commented Oct 30, 2022

pohly commented Oct 31, 2022

tallclair commented Nov 1, 2022

pohly commented Dec 19, 2022

danwinship commented Jan 30, 2023

pohly commented Jan 30, 2023

prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

Comments

pohly commented Apr 21, 2022

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Apr 21, 2022

pohly commented Apr 21, 2022

pohly commented Apr 21, 2022

tallclair commented Apr 22, 2022

AxeZhan commented Apr 23, 2022

pohly commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

pohly commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

AxeZhan commented Apr 25, 2022

k8s-triage-robot commented Jul 24, 2022

pohly commented Aug 1, 2022

k8s-triage-robot commented Oct 30, 2022

pohly commented Oct 31, 2022

tallclair commented Nov 1, 2022

pohly commented Dec 19, 2022

danwinship commented Jan 30, 2023

pohly commented Jan 30, 2023