Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e should retry if service is not available #118256

Merged
merged 1 commit into from
May 25, 2023

Conversation

aojea
Copy link
Member

@aojea aojea commented May 25, 2023

/kind bug
/kind flake

NONE

Found in this job https://github.com/cilium/cilium/actions/runs/5070707417?pr=25653

2023-05-24T15:55:41.7757181Z   May 24 15:55:14.545: INFO: Failed inside E2E framework:
2023-05-24T15:55:41.7760960Z       k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodsResponding({0x7f0504379858, 0xc002fa5950}, {0x72b3810?, 0xc00424a000}, {0xc0013d3670, 0xd}, {0x6b12fdc, 0x1c}, 0x0, 0xd18c2e2800, ...)
2023-05-24T15:55:41.7761492Z       	test/e2e/framework/pod/wait.go:625 +0x3ef
2023-05-24T15:55:41.7765092Z       k8s.io/kubernetes/test/e2e/framework/pod.podRunningMaybeResponding({0x7f0504379858, 0xc002fa5950}, {0x72b3810, 0xc00424a000}, {0xc0013d3670, 0xd}, {0x6b12fdc, 0x1c}, 0x0?, 0x1, ...)
2023-05-24T15:55:41.7765751Z       	test/e2e/framework/pod/resource.go:117 +0x11d
2023-05-24T15:55:41.7766578Z       k8s.io/kubernetes/test/e2e/framework/pod.VerifyPods(...)
2023-05-24T15:55:41.7767005Z       	test/e2e/framework/pod/resource.go:99
2023-05-24T15:55:41.7767519Z       k8s.io/kubernetes/test/e2e/network.glob..func27.22({0x7f0504379858, 0xc002fa5950})
2023-05-24T15:55:41.7767947Z       	test/e2e/network/service.go:1816 +0xbfb
2023-05-24T15:55:41.7768697Z   �[1mSTEP:�[0m stopping RC slow-terminating-unready-pod in namespace services-7833 �[38;5;243m@ 05/24/23 15:55:14.546�[0m
2023-05-24T15:55:41.7773463Z   �[1mSTEP:�[0m deleting service tolerate-unready in namespace services-7833 �[38;5;243m@ 05/24/23 15:55:15.054�[0m

The tests uses VerifyPods to check the pods state

framework.ExpectNoError(e2epod.VerifyPods(ctx, t.Client, t.Namespace, t.Name, false, 1))

that uses

func VerifyPods(ctx context.Context, c clientset.Interface, ns, name string, wantName bool, replicas int32) error {
return podRunningMaybeResponding(ctx, c, ns, name, wantName, replicas, true)
}

that calls podRunningMaybeResponding with checkResponding to true

func podRunningMaybeResponding(ctx context.Context, c clientset.Interface, ns, name string, wantName bool, replicas int32, checkResponding bool) error {
pods, err := PodsCreated(ctx, c, ns, name, replicas)
if err != nil {
return err
}
e := podsRunning(ctx, c, pods)
if len(e) > 0 {
return fmt.Errorf("failed to wait for pods running: %v", e)
}
if checkResponding {
return WaitForPodsResponding(ctx, c, ns, name, wantName, podRespondingTimeout, pods)
}
return nil
}

and here there is something I really don't understand much WaitForPodsResponding uses the apiserver proxy to connect to the pod :/, it does not seem the best way to do it but seems to be there for a long time

func WaitForPodsResponding(ctx context.Context, c clientset.Interface, ns string, controllerName string, wantName bool, timeout time.Duration, pods *v1.PodList) error {
if timeout == 0 {
timeout = podRespondingTimeout
}
ginkgo.By("trying to dial each unique pod")
label := labels.SelectorFromSet(labels.Set(map[string]string{"name": controllerName}))
options := metav1.ListOptions{LabelSelector: label.String()}
type response struct {
podName string
response string
}
get := func(ctx context.Context) ([]response, error) {
currentPods, err := c.CoreV1().Pods(ns).List(ctx, options)
if err != nil {
return nil, fmt.Errorf("list pods: %w", err)
}
var responses []response
for _, pod := range pods.Items {
// Check that the replica list remains unchanged, otherwise we have problems.
if !isElementOf(pod.UID, currentPods) {
return nil, gomega.StopTrying(fmt.Sprintf("Pod with UID %s is no longer a member of the replica set. Must have been restarted for some reason.\nCurrent replica set:\n%s", pod.UID, format.Object(currentPods, 1)))
}
ctxUntil, cancel := context.WithTimeout(ctx, singleCallTimeout)
defer cancel()
body, err := c.CoreV1().RESTClient().Get().
Namespace(ns).
Resource("pods").
SubResource("proxy").
Name(string(pod.Name)).
Do(ctxUntil).
Raw()
if err != nil {

but anyway, this is for other discussion, the problem is that the Eventually loop does not retry in this specific error

err := framework.Gomega().
Eventually(ctx, framework.HandleRetry(get)).
WithTimeout(timeout).
Should(framework.MakeMatcher(match))
if err != nil {

2023-05-24T22:18:18.9343489Z   �[38;5;9m[FAILED] checking pod responses: Told to stop trying after 0.229s.
2023-05-24T22:18:18.9367553Z   Unexpected final error while getting []pod.response: Controller my-hostname-basic-ccb0a63d-f67f-4d4f-ad7f-2791ca1fe1b4: failed to Get from replica pod my-hostname-basic-ccb0a63d-f67f-4d4f-ad7f-2791ca1fe1b4-rbf2q:
2023-05-24T22:18:18.9381847Z       <*errors.StatusError | 0xc002e38960>: 
2023-05-24T22:18:18.9390719Z       the server is currently unable to handle the request (get pods my-hostname-basic-ccb0a63d-f67f-4d4f-ad7f-2791ca1fe1b4-rbf2q)
2023-05-24T22:18:18.9392497Z       {
2023-05-24T22:18:18.9400817Z           ErrStatus: 
2023-05-24T22:18:18.9410357Z               code: 503
2023-05-24T22:18:18.9426937Z               details:
2023-05-24T22:18:18.9442382Z                 causes:
2023-05-24T22:18:18.9446422Z                 - message: unknown
2023-05-24T22:18:18.9451842Z                   reason: UnexpectedServerResponse
2023-05-24T22:18:18.9457005Z                 kind: pods
2023-05-24T22:18:18.9461651Z                 name: my-hostname-basic-ccb0a63d-f67f-4d4f-ad7f-2791ca1fe1b4-rbf2q
2023-05-24T22:18:18.9463808Z               message: the server is currently unable to handle the request (get pods my-hostname-basic-ccb0a63d-f67f-4d4f-ad7f-2791ca1fe1b4-rbf2q)
2023-05-24T22:18:18.9468094Z               metadata: {}
2023-05-24T22:18:18.9468480Z               reason: ServiceUnavailable
2023-05-24T22:18:18.9474577Z               status: Failure,
2023-05-24T22:18:18.9476823Z       }
2023-05-24T22:18:18.9477229Z   Pod status:

It seems that is fair game to retry on IsServiceUnavailablesince we are just waiting for the proxy to pod service to be available

the e2e framwork use active loops to wait for certain async operations,
these loops need to retry on some operations and fail in others.

For the functions that depend on some operations to happen, the
apiserver may return 503 errors until that specific service is
available, so we should retry on those too.

Change-Id: Ib3d194184f6385b9d3d151c7055f27c97c21c3ff
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 25, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 25, 2023
@aojea aojea changed the title e2e should retry if Service is not e2e should retry if service is not available May 25, 2023
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2023
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 25, 2023
@aojea
Copy link
Member Author

aojea commented May 25, 2023

/assign @pohly
/cc @joestringer

@k8s-ci-robot
Copy link
Contributor

@aojea: GitHub didn't allow me to request PR reviews from the following users: joestringer.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @pohly
/cc @joestringer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems safe. The worst that can happen is that a test which encounters such an error when it is permanent keeps retrying until it times out and the fails with a report which mentions the error.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 621493692b8aac7d84133fe9dea40b3ecd198693

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added a commit that referenced this pull request Jun 6, 2023
…6-upstream-release-1.27

Automated cherry pick of #118256: e2e framework retry on Service unavailable errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants