Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent the "Expect(err).NotTo(HaveOccurred())" and "ExpectNoError(err)" anti-pattern in e2e tests #109600

Open
pohly opened this issue Apr 21, 2022 · 18 comments
Assignees
Labels
area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@pohly
Copy link
Contributor

pohly commented Apr 21, 2022

What would you like to be added?

This was previously tracked in #34059 (created in 2016, closed in 2020 after several tests were enhanced).

I'm starting to wonder whether the problem is getting worse again because we still don't catch new tests that use this anti-pattern (for example, #107763).

Is it time to start another spring clean which removes occurrences of this anti-pattern? Or perhaps add some mechanism that prevents it in new code?

/sig testing
/area e2e-test-framework

Why is this needed?

I'm getting tired of seeing failures where the only output is

Expected error:
    <*errors.errorString | 0xc820412c10>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

This is the result of

Expect(err).NotTo(HaveOccurred())

or more recently

ExpectNoError(err)

An explanation string would help, but even better solutions would be possible (#106575).

@pohly pohly added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 21, 2022
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework labels Apr 21, 2022
@k8s-ci-robot
Copy link
Contributor

@pohly: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 21, 2022
@pohly
Copy link
Contributor Author

pohly commented Apr 21, 2022

One thing that we could do is add an empty string as explanation with a TODO comment to all current calls that don't have one and then make the explanation parameter(s) required in

// ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error.
func ExpectNoError(err error, explain ...interface{}) {
ExpectNoErrorWithOffset(1, err, explain...)
}

Code without an explanation then would look like this:

ExpectNoError(err, "" /* TODO: add more details about the error, see https://github.com/kubernetes/kubernetes/issues/109600*/)

@pohly
Copy link
Contributor Author

pohly commented Apr 21, 2022

The "TODO" could be removed in places where the error is known to be informative enough. The problem is that this can only be determined by examining the code. It also might not be obvious. I'm leaning to always requiring a non-empty string, everything else seems rather fragile.

@tallclair
Copy link
Member

+1 for requiring a non-empty string.

I suggest we update all the functions in kubernetes/test/e2e/framework/expect.go to make adding some context non-optional, for example:

// ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error.
// context should describe where the error came from, such as the intent of the call that produced the error.
func ExpectNoError(err error, contextMsg string, explain ...interface{}) { 
 	if contextMsg == "" { /* fail */ }
 	explain = append([]interface{}{contextMsg}, explain...)
 	ExpectNoErrorWithOffset(1, err, explain...) 
 }

As a first pass, callers could just be updated to set the contextMsg to "TODO"

@AxeZhan
Copy link
Member

AxeZhan commented Apr 23, 2022

Hey guys, can I try to fix this one?
I'm going to change ExpectNoError like @tallclair said

// ExpectNoError checks if "err" is set, and if so, fails assertion while logging the error.
func ExpectNoError(err error, contextMsg string, explain ...interface{}) {
	if len(contextMsg) == 0 {
		contextMsg = "Add more details about the error, see https://github.com/kubernetes/kubernetes/issues/109600"
	}
	ExpectNoErrorWithOffset(1, err, append([]interface{}{contextMsg}, explain...))
}

and for those calls without contextMsg, I can temporarily replace it with an empty string,like
framework.ExpectNoError(err,"")

so it can print the message about this issue,and it doesn't affect compilation

@pohly
Copy link
Contributor Author

pohly commented Apr 25, 2022

I would just continue to call it explain instead of contextMsg because that is what the parameter was called before. We are not introducing a new concept, we just make the existing one mandatory. One thing that should be clarified on this occasion is that "explain" is a format string. The framework wrappers don't mention this at the moment.

How does this sound?

// ExpectNoError checks if "err" is set, and if so, fails the assertion while logging the error.
//
// An additional printf-style explanation must be provided because often errors
// cannot be understood without it. In cases where the error is known to
// be descriptive enough in all cases, an empty explainFormatStr can be
// passed.
func ExpectNoError(err error, explainFormatStr string, explainArgs ...interface{})

It's worth calling out that we loose some functionality when doing this: previously, explain could also be a func() string which was called only if needed. But I'm pretty sure that this isn't known widely enough that it got used much, if at all, and it shouldn't be a big loss if we don't support it. If someone wants it, they can still call Gomega directly.

I agree that just passing "TODO" sounds like a good idea.

Do we want to do this in one PR or in multiple? If we do it in one, we pretty much need a root approver, otherwise it will be blocked until all the various tests owners add their approval. It could easily get stuck in rebase hell. If we split it up along ownership boundaries, we still need lots of approvals, but might have to rebase less. I'd prefer a single PR that goes in early in the 1.25 cycle, if we can get that through.

@kidddddddddddddddddddddd: you are welcome to work on this, but let's first clarify exactly what we want and how to do it.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 25, 2022

I would just continue to call it explain instead of contextMsg because that is what the parameter was called before. We are not introducing a new concept, we just make the existing one mandatory. One thing that should be clarified on this occasion is that "explain" is a format string. The framework wrappers don't mention this at the moment.

How does this sound?

// ExpectNoError checks if "err" is set, and if so, fails the assertion while logging the error.
//
// An additional printf-style explanation must be provided because often errors
// cannot be understood without it. In cases where the error is known to
// be descriptive enough in all cases, an empty explainFormatStr can be
// passed.
func ExpectNoError(err error, explainFormatStr string, explainArgs ...interface{})

It's worth calling out that we loose some functionality when doing this: previously, explain could also be a func() string which was called only if needed. But I'm pretty sure that this isn't known widely enough that it got used much, if at all, and it shouldn't be a big loss if we don't support it. If someone wants it, they can still call Gomega directly.

I agree that just passing "TODO" sounds like a good idea.

Do we want to do this in one PR or in multiple? If we do it in one, we pretty much need a root approver, otherwise it will be blocked until all the various tests owners add their approval. It could easily get stuck in rebase hell. If we split it up along ownership boundaries, we still need lots of approvals, but might have to rebase less. I'd prefer a single PR that goes in early in the 1.25 cycle, if we can get that through.

@kidddddddddddddddddddddd: you are welcome to work on this, but let's first clarify exactly what we want and how to do it.

I also prefer a single PR.
But how do we modify the current function calls.Should we just add an empty explainFormatStr string or add a "TODO". Like you said, in cases, an empty string can be passed.
I think we can pass a "TODO" as explainFormatStr for all those calls without explain, but this will result in more cleanup later.
Or we can add some simple explanation for every call, but this will make the review harder.

@pohly
Copy link
Contributor Author

pohly commented Apr 25, 2022

But how do we modify the current function calls.Should we just add an empty explainFormatStr string or add a "TODO".

Such a single PR must be very simple, otherwise we get bogged down in discussions around what individual strings should be. Therefore explainFormatStr = "TODO" makes sense to me. Later code owners can and should spend some of their own time on addressing the TODOs in their tests.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 25, 2022

Such a single PR must be very simple, otherwise we get bogged down in discussions around what individual strings should be. Therefore explainFormatStr = "TODO" makes sense to me. Later code owners can and should spend some of their own time on addressing the TODOs in their tests.

Sounds good to me,I'll make a simple PR in a day or two.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 25, 2022

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2022
@pohly
Copy link
Contributor Author

pohly commented Aug 1, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2022
@pohly
Copy link
Contributor Author

pohly commented Oct 31, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2022
@tallclair
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 1, 2022
@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2022

We are not done with this and it is still a problem. The number one error on https://storage.googleapis.com/k8s-triage/index.html right now is:

Dec  4 07:05:39.498: timed out waiting for the condition
test/e2e/framework/framework.go:241
k8s.io/kubernetes/test/e2e/framework.(*Framework).BeforeEach(0xc0012dc5a0)
	test/e2e/framework/framework.go:241 +0x96f

err = WaitForDefaultServiceAccountInNamespace(ctx, f.ClientSet, namespace.Name)
ExpectNoError(err)

@danwinship
Copy link
Contributor

So I suggested on #109728 that this is not an ExpectNoError() problem, it's a utilwait problem. Every time someone complains about error messages with no context, they are complaining about "timed out waiting for the condition". Forcing people to add additional context to other ExpectNoError calls just results in redundant error messages. So let's just redo the utilwait methods. Eg, add a required "waitingFor string" argument, so it can return fmt.Errorf("timed out waiting for %s", waitingFor). Or have the functions return (bool, error) instead of error, where false, nil means "timed out" and the caller then has to construct their own contextually-relevant error message. (If we want better error messages, banning fmt.Errorf("constant string") seems better than banning ExpectNoError(err).)

Changing the API would also provide an opportunity to add a way to say "if you time out, return the last error returned from the ConditionFunc rather than a timeout error", which is pretty common.

Maybe they could all take an input struct like Backoff which would allow squashing a lot of the variants down into fewer functions.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2023

So let's just redo the utilwait methods.

I agree that this must be the first step. I wrote down some thoughts about what polling should support here:
kubernetes/community#7021

I implemented what I envision to meet these requirements here:
#113298

But I still think that adding additional information in ExpectNoError() can be useful. For example, consider a test which calls the same helper for two different pods:

framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod1, ...))
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod2, ...))

Even if the improved error now includes the pod name, test failures become better when explaining what the pod is for:

framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod1, ...), "server pod")
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(..., pod2, ...), "client pod")

I'm coming to the conclusion that there's no technical solution for this: test authors and reviewers will have to consider manually whether it should be called with an additional explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants