Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage e2e: Add context to timeout errors #87168

Merged
merged 1 commit into from Jan 15, 2020

Conversation

@misterikkit
Copy link
Contributor

misterikkit commented Jan 13, 2020

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Many times an e2e test fails with an unexpected error,
"timed out waiting for the condition".

Useful information may be in the test logs, but debugging e2e test
failures will be much faster if we add context to errors when they
happen.

This change makes sure we add context to all errors returned from
helpers like wait.Poll().

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@misterikkit

This comment has been minimized.

Copy link
Contributor Author

misterikkit commented Jan 13, 2020

/sig storage

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Jan 13, 2020

/lgtm
/approve

@misterikkit

This comment has been minimized.

Copy link
Contributor Author

misterikkit commented Jan 13, 2020

/hold
/assign @davidz627

Copy link
Contributor

davidz627 left a comment

thanks for spending the time on this. It'll make tests way easier to debug in the future. Where did 12 comments from? I don't know. But they're all minor

if waitErr != nil {
return fmt.Errorf("error waiting for volume %v to attach to node %v: %v", volumeName, nodeName, waitErr)
}
return nil
}

func waitForVolumesInUse(client clientset.Interface, nodeName, volumeName string) error {

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

nit: waitForVolumesInUse and waitForVolumesInUse could technically be consolidated into waitForVolumeUsage(inUse bool). Though I guess that could be a little confusing depending on how you word/intepret it.

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

You're right that these could be consolidated, and waitForVolumesAttached is also nearly identical. Making sure the code is still readable would probably mean passing in some predicate function and creating more helpers for checking things like volumeIsAttached(volumeName string, node *v1.Node) bool

That seems a little on the heavy side to clean up a little duplication here. It would probably not cause problems since this code changes rarely?

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 14, 2020

Contributor

lets leave it be then :)

@@ -179,7 +183,10 @@ func waitForDeploymentToRecreatePod(client clientset.Interface, deployment *apps
return false, conditions.ErrPodCompleted
}
}
return false, err
return false, nil

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

this is a functional change. looks correct though - surprised the original was not manifesting in any bugs

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

I think the call sites are good about not reading the return value if err != nil.

@@ -570,7 +570,8 @@ var _ = utils.SIGDescribe("PersistentVolumes-local ", func() {
}()

ginkgo.By("Waiting for all pods to complete successfully")
err := wait.PollImmediate(time.Second, 5*time.Minute, func() (done bool, err error) {
const completeTimeout = 5 * time.Minute

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

nit: not strictly necessary (used once afaik)

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

I pulled it into a const since it is also used in the error context.

@@ -590,14 +591,15 @@ var _ = utils.SIGDescribe("PersistentVolumes-local ", func() {
numFinished++
framework.Logf("%v/%v pods finished", numFinished, totalPods)
case v1.PodFailed:
fallthrough

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

it makes sense that PodFailed is an unrecoverable error, but actually is PodUnknown unrecoverable? Should that case actually just continue through Polling loop instead of throwing an error?

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

In the existing code, PodUnknown is unrecoverable as we return an error, while PodFailed is just skipped and we get to the return nil on line 600.

Either this was intended to be a fallthrough but the original author wrote it the wrong way, or it is intentionally a no-op and we should add a comment. I'll be safe and keep the current logic.

framework.ExpectNoError(err)
framework.ExpectEqual(pvc.Status.Phase, v1.ClaimPending)
if err != nil {
return false, fmt.Errorf("failed to get PVC %s/%s: %v", volume.pvc.Namespace, volume.pvc.Name, err)

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

not sure that this is a terminal error, should we just log and try again?

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

The assumption in this code is that the PVCs already exist. Safe assumption, looking at lines 915-916

The previous behavior called framework.ExpectNoError which would cause the stack to unwind immediately (and without context). By returning an error here, we make it to line 943 before unwinding the stack (and adding context).

@@ -268,7 +268,8 @@ func cleanupTest(cs clientset.Interface, ns string, runningPodName, unschedulabl
return nil
}

func waitForAllPVCsPhase(cs clientset.Interface, timeout time.Duration, pvcs []*v1.PersistentVolumeClaim) (sets.String, error) {
// waitForAllPVCsBound waits until the given PVCs are all bound. It then returns the bound PVCs as a set.

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

nit: bound PVC [names]

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

+1

// Cleanup the test resources before breaking
defer deleteProvisionedVolumesAndDisks(c, residualPVs)
framework.ExpectNoError(err)

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

nit: context

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

I added context. Also, since ExpectNoError causes an early return, I removed the unreachable code below.

detachTimeout = 5 * time.Minute
detachPollTime = 10 * time.Second
)
err = wait.Poll(detachPollTime, detachTimeout, func() (bool, error) {
err := wait.Poll(detachPollTime, detachTimeout, func() (bool, error) {

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

nit: waitErr for style consistency - also no shadowing without a shadow of a doubt

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

+1

if newNode != oldNode {
framework.Logf("The pod has been failed over from %q to %q", oldNode, newNode)
if currentNode != oldNode {
framework.Logf("The pod has been failed over from %q to %q", oldNode, currentNode)

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

is this log necessary? It should be fairly clear this happened from a successful exit of the function waitForPodToFailover

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

it didn't seem like it was hurting anyone. Probably this level of logging belongs in the caller. How much do you care about cleaning this one up?

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 14, 2020

Contributor

not enough

@@ -170,10 +168,9 @@ func waitForPodToFailover(client clientset.Interface, deployment *appsv1.Deploym

if err != nil {
if err == wait.ErrWaitTimeout {
framework.Logf("Time out after waiting for %v", timeout)
return "", fmt.Errorf("pod has not failed over after %v: %v", timeout, err)

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 13, 2020

Contributor

for all these tests where we check err == wait.ErrWaitTimeout why is it important to disambiguate this from the standard error case? Wont the err contain the information required to understand what kind of error it is, timeout or not

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jan 14, 2020

Author Contributor

The difference is whether the timeout duration gets added into the log. Only benefit is making the error output more readable, which is why I left it as-is. Example output would be,

pod has not failed over after 5m: timed out waiting for the condition

or

pod did not fail over from "node-1": Pod Not Found

This comment has been minimized.

Copy link
@davidz627

davidz627 Jan 14, 2020

Contributor

got it, change looks good then

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 14, 2020
@misterikkit misterikkit force-pushed the misterikkit:timout-err branch from 75d01b8 to 17eafc4 Jan 14, 2020
@davidz627

This comment has been minimized.

Copy link
Contributor

davidz627 commented Jan 14, 2020

Let me know when ready for review again

Many times an e2e test fails with an unexpected error,
"timed out waiting for the condition".

Useful information may be in the test logs, but debugging e2e test
failures will be much faster if we add context to errors when they
happen.

This change makes sure we add context to all errors returned from
helpers like wait.Poll().
@misterikkit misterikkit force-pushed the misterikkit:timout-err branch from 17eafc4 to 99a0157 Jan 14, 2020
@misterikkit

This comment has been minimized.

Copy link
Contributor Author

misterikkit commented Jan 14, 2020

Commits are squashed. PTAL

/hold cancel

@davidz627

This comment has been minimized.

Copy link
Contributor

davidz627 commented Jan 14, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 14, 2020
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 14, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627, misterikkit, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@davidz627

This comment has been minimized.

Copy link
Contributor

davidz627 commented Jan 14, 2020

/retest

1 similar comment
@misterikkit

This comment has been minimized.

Copy link
Contributor Author

misterikkit commented Jan 14, 2020

/retest

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 15, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 14ed357 into kubernetes:master Jan 15, 2020
20 checks passed
20 checks passed
cla/linuxfoundation misterikkit authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-kind-ipv6-parallel Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-snapshot Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.