New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[E2E] enhance pod polling failures #109828
Conversation
@pohly: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It's ironic that a PR about polling runs into this (unrelated) flake (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109828/pull-kubernetes-e2e-gce-ubuntu-containerd/1522227643875856384/):
|
Another failure is also instructive (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109828/pull-kubernetes-conformance-kind-ipv6-parallel/1522227643787776000):
Here the simplified failure printing isn't done because the error isn't getting wrapped properly: kubernetes/test/e2e/framework/util.go Line 873 in 751e9c0
Should we do a search/replace of |
5be20d0
to
ca4ccd6
Compare
Another sub-optimal timeout failure (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109828/pull-kubernetes-conformance-kind-ga-only-parallel/1522304400956067840):
The problem in test/e2e/framework/pods.go is two-fold:
Does anybody remember what the purpose of these Gomega wrapper functions are? Are they meant to be syntactic sugar and thus optional or are they part of some plan to get rid of Gomega, i.e. should be used everywhere? Now an argument could be made for requiring the use of the framework wrapper functions (better failure message handling). @Random-Liu: you have one TODO on PodClient.WaitForSuccess, the function that failed here. Any comments? |
Thockin tried with https://github.com/quasilyte/go-ruleguard but it needs to import an external module just for linting 🤷 |
If we limit that dependency to But first we need to decide what the rule should be - always wrap the underlying error? That is not necessarily always the right approach, at least not outside of test/e2e. |
func (e *timeoutError) Is(target error) bool { | ||
return target == ErrFailureMsg | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right to me. Every ErrFailureMsg
is a timeout error, but a timeout error isn't a timeout error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this says is "a timeout error is a ErrFailureMsg" and thus gets treated like one by ExpectNoErrorWithOffset
.
test/e2e/framework/expect.go
Outdated
@@ -43,6 +47,18 @@ func ExpectNoError(err error, explain ...interface{}) { | |||
// ExpectNoErrorWithOffset checks if "err" is set, and if so, fails assertion while logging the error at "offset" levels above its caller | |||
// (for example, for call chain f -> g -> ExpectNoErrorWithOffset(1, ...) error would be logged for "f"). | |||
func ExpectNoErrorWithOffset(offset int, err error, explain ...interface{}) { | |||
if err != nil && | |||
errors.Is(err, e2eerrors.ErrFailureMsg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with this direction. I think in most cases this is the behavior we want.
What if instead, we always only fail with the actual error message (what you have here), but we always log the raw error detail, so the information is available if you look for it, but doesn't clutter the top-level failure messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me. I didn't want to go that far because I wasn't sure whether it would reduce usability in other cases, but logging the error inside the test output should prevent that.
test/e2e/framework/pods.go
Outdated
@@ -69,10 +69,22 @@ var ImagePrePullList sets.String | |||
// PodClient is a convenience method for getting a pod client interface in the framework's namespace, | |||
// possibly applying test-suite specific transformations to the pod spec, e.g. for | |||
// node e2e pod scheduling. | |||
func (f *Framework) PodClient() *PodClient { | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The changes in this file look unrelated to the others. I suggest moving these to a separate PR.
test/e2e/framework/pods.go
Outdated
skipFrames = []int{0} | ||
case 1: | ||
default: | ||
Failf("expect at most one skipFrames parameter, got: %v", skipFrames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like using varargs in this way, since it doesn't surface errors at compile time. I'd prefer to create a second method that takes the argument.
test/e2e/framework/pods.go
Outdated
} | ||
|
||
func (c *PodClient) createSync(skipFrames int, pod *v1.Pod) *v1.Pod { | ||
skipFrames++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skipFrames
stuff is a bit hard to follow, and seems fragile. What if instead you just iterate through the callers until you get one outside this file? e.g.
func (c *PodClient) callerOffset() int {
_, podClientFile, _, ok := runtime.Caller(0)
if !ok {
return c.skipFrames+2; // callerOffset() + caller
}
for i := 2; i < MaxDepth; i++ {
_, f, _, ok := runtime.Caller(i)
if f != podClientFile || !ok {
return c.skipFrames + i
}
}
// call stack too deep
return c.skipFrames + MaxDepth
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work when the code in pods.go is called by some other helper function. Then we would log that helper function instead of the test code.
Skipping frames by counting them is a normal pattern for logging and also how Gomega does it. I agree that off-by-one errors are problematic, but I don't see a good alternative. The "testing" package allows helper functions to mark themselves as helper, but that only works in unit tests and for stack unwinding by the standard library.
test/e2e/framework/pods.go
Outdated
@@ -301,6 +326,6 @@ func (c *PodClient) MatchContainerOutput(name string, containerName string, expe | |||
// PodIsReady returns true if the specified pod is ready. Otherwise false. | |||
func (c *PodClient) PodIsReady(name string) bool { | |||
pod, err := c.Get(context.TODO(), name, metav1.GetOptions{}) | |||
ExpectNoError(err) | |||
ExpectNoErrorWithOffset(c.skipFrames+1, err, "get pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExpectNoErrorWithOffset(c.skipFrames+1, err, "get pod") | |
ExpectNoErrorWithOffset(c.skipFrames+1, err, "getting pod to check PodIsReady") |
I have rebased on top of master where unit tests in test/e2e get run. The PR now uses the same kind of testing as test/e2e/framework/log_test.go, with the common code split out into a helper package. I am starting with testing of the current pod/wait.go without any modifications. Even if we don't agree on the other changes, then that is still worth merging because it provides unit testing for the changes from PR #109704 The next commits then make changes to failure reporting. The effect can be seen in the diff for the new pod/wait_test.go. I think these changes are useful, but it is debatable. For example, the "INFO: Unexpected error" log message becomes longer and fairly unreadable when including more text in the error message. |
/retest |
test/e2e/framework/expect.go
Outdated
} | ||
} | ||
Fail(prefix+err.Error(), 1+offset) | ||
} | ||
gomega.ExpectWithOffset(1+offset, err).NotTo(gomega.HaveOccurred(), explain...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is gomega doing anything special under the hood here if err == nil
? Or should you just delete this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a leftover from the version where gomega was called for other, non-nil errors. Now that all non-nil errors are handled by our own code it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/e2e/framework/pod/wait.go
Outdated
if IsTimeout(err) { | ||
e2elog.Logf("Timed out while waiting for pod %s to be %s. Last observed as: %s", | ||
if IsTimeout(err) && lastPod != nil { | ||
return TimeoutError("timed out while waiting for pod %s to be %s: last observed as:\n%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does our test infrastructure aggregate failures? If anything is looking for identical failure messages, this will no longer match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will no longer match because the dumped pod state will always be different, right?
The first line will still be the same, so searching for that should still work. I don't know whether that is going to be enough. Let me ask on #sig-testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like failure text differences are already handled:
https://storage.googleapis.com/k8s-triage/index.html?text=timed%20out%20while%20waiting
351 test failures (44 today) look like [link](https://storage.googleapis.com/k8s-triage/index.html?text=timed%20out%20while%20waiting#32d66067bb5fe4ce2e3e)[search github](https://github.com/search?type=Issues&q=org:kubernetes%2032d66067bb5fe4ce2e3e)[file bug](https://storage.googleapis.com/k8s-triage/index.html?text=timed%20out%20while%20waiting#)storage
test/e2e/storage/nfs_persistent_volume-disruptive.go:153
May 5 11:08:28.112: Unexpected error:
<*errors.errorString | 0xc00654af50>: {
s: "pod \"pvc-tester-8c6qg\" is not Running: timed out while waiting for pod disruptive-pv-142/pvc-tester-8c6qg to be running",
}
pod "pvc-tester-8c6qg" is not Running: timed out while waiting for pod disruptive-pv-142/pvc-tester-8c6qg to be running
occurred
test/e2e/storage/nfs_persistent_volume-disruptive.go:185
Note that pod name and namespace already vary in each failure instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ben explained on Slack that this particular example works because varying strings like the 142 suffix get removed via regular expressions. Then remaining content must have <10% editing difference.
So including the pod could be a real problem. It could also help if the additional information let's the tool distinguish between different root causes for a failure, but I am a bit pessimistic about that.
Let's drop this part?
test/e2e/framework/pod/wait.go
Outdated
if IsTimeout(err) { | ||
e2elog.Logf("Timed out while waiting for pod %s to be %s. Last observed as: %s", | ||
if IsTimeout(err) && lastPod != nil { | ||
return TimeoutError("timed out while waiting for pod %s to be %s: last observed as:\n%s", | ||
podIdentifier(ns, podName), conditionDesc, format.Object(lastPod, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no action required, maybe a follow-up) I've already found these pod logs useful for debugging tests, but they are quite large and noisy. I'm thinking about using a yaml serializer instead, so that omitEmpty fields aren't (which would significantly cut down on the output size).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a useful enhancement for gomega/format.Object
: skip fields that are marked as optional for JSON or YAML.
Otherwise we'll be stuck between a rock and a hard place when choosing a formatter: YAML serializers typically ignore unexported fields, which is bad for error types. output.Object prints those, but doesn't skip.
/cc @onsi
test/e2e/framework/pod/wait_test.go
Outdated
INFO: Pod "pending-pod": Phase="", Reason="", readiness=false. Elapsed: <elapsed> | ||
INFO: Unexpected error: | ||
<*pod.timeoutError>: { | ||
msg: "timed out while waiting for pod default/pending-pod to be running: last observed as:\n <*v1.Pod>: {\n TypeMeta: {Kind: \"\", APIVersion: \"\"},\n ObjectMeta: {\n Name: \"pending-pod\",\n GenerateName: \"\",\n Namespace: \"default\",\n SelfLink: \"\",\n UID: \"\",\n ResourceVersion: \"\",\n Generation: 0,\n CreationTimestamp: {\n Time: 0001-01-01T00:00:00Z,\n },\n DeletionTimestamp: nil,\n DeletionGracePeriodSeconds: nil,\n Labels: nil,\n Annotations: nil,\n OwnerReferences: nil,\n Finalizers: nil,\n ManagedFields: nil,\n },\n Spec: {\n Volumes: nil,\n InitContainers: nil,\n Containers: nil,\n EphemeralContainers: nil,\n RestartPolicy: \"\",\n TerminationGracePeriodSeconds: nil,\n ActiveDeadlineSeconds: nil,\n DNSPolicy: \"\",\n NodeSelector: nil,\n ServiceAccountName: \"\",\n DeprecatedServiceAccount: \"\",\n AutomountServiceAccountToken: nil,\n NodeName: \"\",\n HostNetwork: false,\n HostPID: false,\n HostIPC: false,\n ShareProcessNamespace: nil,\n SecurityContext: nil,\n ImagePullSecrets: nil,\n Hostname: \"\",\n Subdomain: \"\",\n Affinity: nil,\n SchedulerName: \"\",\n Tolerations: nil,\n HostAliases: nil,\n PriorityClassName: \"\",\n Priority: nil,\n DNSConfig: nil,\n ReadinessGates: nil,\n RuntimeClassName: nil,\n EnableServiceLinks: nil,\n PreemptionPolicy: nil,\n Overhead: nil,\n TopologySpreadConstraints: nil,\n SetHostnameAsFQDN: nil,\n OS: nil,\n },\n Status: {\n Phase: \"\",\n Conditions: nil,\n Message: \"\",\n Reason: \"\",\n NominatedNodeName: \"\",\n HostIP: \"\",\n PodIP: \"\",\n PodIPs: nil,\n StartTime: nil,\n InitContainerStatuses: nil,\n ContainerStatuses: nil,\n QOSClass: \"\",\n EphemeralContainerStatuses: nil,\n },\n }", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤮
This direct error output is always going to be a little gross, but one way you could clean this up is to include the pod directly in the TimeoutError struct, and just dump it when error is called. for example:
type timeoutError struct {
msg string
lastObservedPod *v1.Pod // or runtimeObject if we decide to generalize this
}
func (e *timeoutError) Error() string {
if e.lastObservedPod == nil {
return e.msg
} else {
return fmt.Sprintf("%s: last observed as: %s", e.msg, format.Object(e.lastObservedPod, 1))
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We could also make the timeout error generic and let it have a slice of objects that get stored as interface{}
.
Would it make sense to split out the timeout error into its own package? There are other framework packages which implement polling, they may need the same timeout error eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last commit now uses that approach, but without including the extra information in the error message. I am not adding it there because of the discussion around aggregation of failure messages - I am too worried that this won't work anymore.
I think it is a slight improvement because the overall output is shorter.
The timeout error is generic and could be moved as-is into a separate package, but I haven't done that yet.
If I did, should I use test/e2e/framework/errors
or test/e2e/framework/e2eerrors
? The latter avoids having to rename the package on each import to avoid clashes with the standard errors
package, but it would be inconsistent because the other packages don't do that.
The double ee
is also unfortunate. Perhaps framework/e2etypes
?
f4e98bb
to
05f3080
Compare
@pohly: GitHub didn't allow me to request PR reviews from the following users: onsi. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
prefix = fmt.Sprintf("unexpected explain arguments, need format string: %v", explain) | ||
} | ||
} | ||
Fail(prefix+err.Error(), 1+offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment here as to why we're not using the default gomega assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/e2e/framework/log_test.go
Outdated
// | ||
// | ||
// | ||
// This must be line #43 (first line is #1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does first line is #1
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That my editor is counting from 1. I'm not sure whether all of them do (first line might also be #0). Hmm, compilers and linters all print line numbers. I probably was too cautious here, there is an established convention. Let me remove it.
test/e2e/framework/output/output.go
Outdated
limitations under the License. | ||
*/ | ||
|
||
package output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be in a package testing
to indicate it's for internal testing? I guess that's confusing since the whole thing is in the test tree...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better, "internal"?
test/e2e/framework/pod/wait_test.go
Outdated
INFO: Ignoring NotFound error while getting pod default/no-such-pod | ||
INFO: Ignoring NotFound error while getting pod default/no-such-pod | ||
INFO: Ignoring NotFound error while getting pod default/no-such-pod | ||
INFO: Ignoring NotFound error while getting pod default/no-such-pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the number of attempts stable enough (especially on over subscribed CI infrastructure) that this won't be flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about that myself. It polls every two seconds, so yes, I guess on a busy system it might skip one or more of these polls. I'm reducing those lines to one now.
test/e2e/framework/expect.go
Outdated
|
||
// Errors usually contain unexported fields. We have to use | ||
// a formatter here which can print those. | ||
Logf("Unexpected error:\n%s", format.Object(err, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider including the explain text in this. Something like this?
Logf("Unexpected error:\n%s", format.Object(err, 1)) | |
Logf("Unexpected error: %s\n%s", prefix, format.Object(err, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, added.
This makes it easier to see when comment lines must be added or removed.
The helper code will be useful for testing other parts of the e2e framework.
Both output during the test and the failure message are covered.
The failure message from Gomega was hard to read because explanation and error text were separated by the error dump. In many cases, that error dump doesn't add any relevant information. Now the error is dumped first as info message (just in case that it is relevant) and then a shorter failure message is created from explanation and error text.
The advantage is that the extra error information is guaranteed to be printed directly before the failure and we avoid one extra log line that would have to be correlated with the failure.
Under load we might skip some of the poll events, leading to less output. Now we expect only at least one output line for polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
<*errors.errorString>: { | ||
s: "an error with a long, useless description", | ||
} | ||
an error with a long, useless description | ||
occurred | ||
FAIL: hard-coded error: an error with a long, useless description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an improvement. But after discussing it further on #sig-testing, I think I'll try to make it even better:
FAIL: <short failure message>
<details>
In this case:
FAIL: Unexpected error: hard-coded error: an error with a long, useless description
<*errors.errorString>: {
s: "an error with a long, useless description",
}
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This makes pod timeout failures easier to read (less clutter) and more informative (last pod state is part of failure message).
It also adds tests for this particular framework behavior.
Does this PR introduce a user-facing change?
/cc @tallclair @aojea