Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows-tests: Add retries to Windows assertConsistentConnectivity #120254

Merged

Conversation

ionutbalutoiu
Copy link
Contributor

What type of PR is this?

/kind flake

/sig testing
/sig windows

What this PR does / why we need it:

This pull request addresses a TODO item from the test/e2e/windows/hybrid_network.go file.

The max retries count was chosen to be similar with the value from:

// waitForHTTPServers waits for all webservers to be up, on all protocols sent in the input, and then validates them using the same probe logic as the rest of the suite.
func waitForHTTPServers(k *kubeManager, model *Model) error {
const maxTries = 10
framework.Logf("waiting for HTTP servers (ports 80 and/or 81) to become ready")

In my experience by monitoring the sig-windows-networking Prowjobs, the following pieces of code will rarely fail, if we don't address the TODO item:

This is happening because in the assertConsistentConnectivity function, we have:

gomega.Consistently(ctx, connChecker, duration, pollInterval).ShouldNot(gomega.HaveOccurred())

The gomega.Consistently expects the connChecker to successfully query destination for the duration (10 secs).

However, if the destination is 8.8.8.8 or www.google.com, we can expect failures, so the retries logic is necessary in the connChecker.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

I have this patch already running since a while on sig-windows-networking Prowjobs, and I didn't see the described flakiness scenario reproduced at all.

Unfortunately, I don't have any testgrid links with the rare occasions when failures occurred without this patch.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 30, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @ionutbalutoiu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ionutbalutoiu
Copy link
Contributor Author

/cc @jsturtevant, @marosset

@jsturtevant
Copy link
Contributor

are we sure something else isn't going on? We haven't ever seen flakes on that set of tests in the release jobs:

https://testgrid.k8s.io/sig-windows-signal#capz-windows-master&include-filter-by-regex=Hybrid%20

@ionutbalutoiu
Copy link
Contributor Author

ionutbalutoiu commented Sep 1, 2023

are we sure something else isn't going on? We haven't ever seen flakes on that set of tests in the release jobs:

https://testgrid.k8s.io/sig-windows-signal#capz-windows-master&include-filter-by-regex=Hybrid%20

As far as I can remember, it is what I've described happened.

However, it was on very rare occasions that this piece of code:

gomega.Consistently(ctx, connChecker, duration, pollInterval).ShouldNot(gomega.HaveOccurred())

failed when destination is 8.8.8.8 or www.google.com.

And the failure is happening outside Kubernetes scope, due to the fact that we are reaching the wide internet, and transient networking errors can happen (very rarely, but it can happen).

Perhaps that's why the original code author left a TODO in the tests code:

connChecker := func() error {
ginkgo.By(fmt.Sprintf("checking connectivity of %s-container in %s", os, podName))
// TODO, we should be retrying this similar to what is done in DialFromNode, in the test/e2e/networking/networking.go tests
stdout, stderr, err := e2epod.ExecCommandInContainerWithFullOutput(f, podName, os+"-container", cmd...)

I spawned a testing cluster, and I'm trying to reproduce the failure using --until-it-fails=true ginkgo flag, paired with --delete-namespace-on-failure=false e2e flag.

I'm targeting these tests:
https://testgrid.k8s.io/sig-windows-signal#capz-windows-master&include-filter-by-regex=Hybrid%20

I will report back when I have some concrete tests results.

@jsturtevant
Copy link
Contributor

I will report back when I have some concrete tests results.

Thanks! I generally think it makes sense to add the additional retry since it is going external and things can fail. Since we haven't really ever seen it in our release pipelines I would like to double check we aren't masking a different issue that is causing it to be more prevalent tin the pipelines you are running.

@ionutbalutoiu
Copy link
Contributor Author

I will report back when I have some concrete tests results.

I have some results now.

So, using the ltsc2022-containerd-flannel-sdnoverlay-stable Prowjob config, I ran the tests continuously in the previous days, and they never failed (4538 attempts):

Ran 3 of 7207 Specs in 55.570 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 7204 Skipped
PASS

All tests passed...
Will keep running them until they fail.
This was attempt #4538
No, seriously... you can probably stop now.

And, using the aks-ltsc2022-azurecni-1.27 Prowjob config, I got the following (this executed for less time):

Ran 3 of 7207 Specs in 58.091 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 7204 Skipped
PASS

All tests passed...
Will keep running them until they fail.
This was attempt #1493
No, seriously... you can probably stop now.

I stopped both testing clusters, since I didn't get a repro.

Thanks! I generally think it makes sense to add the additional retry since it is going external and things can fail. Since we haven't really ever seen it in our release pipelines I would like to double check we aren't masking a different issue that is causing it to be more prevalent tin the pipelines you are running.

I'm a bit inclined to have retires, but only for the calls that go external. However, I don't have a strong opinion on this anymore, given that I wasn't able to easily reproduce the failure.

For example, the function assertConsistentConnectivity is used in the codebase here:

  • ginkgo.By("checking connectivity from Linux to Windows")
    assertConsistentConnectivity(ctx, f, linuxPod.ObjectMeta.Name, linuxOS, linuxCheck(windowsPod.Status.PodIP, 80))
    ginkgo.By("checking connectivity from Windows to Linux")
    assertConsistentConnectivity(ctx, f, windowsPod.ObjectMeta.Name, windowsOS, windowsCheck(linuxPod.Status.PodIP))
  • ginkgo.It("should provide Internet connection for Linux containers using DNS [Feature:Networking-DNS]", func(ctx context.Context) {
    linuxPod := createTestPod(f, linuxBusyBoxImage, linuxOS)
    ginkgo.By("creating a linux pod and waiting for it to be running")
    linuxPod = e2epod.NewPodClient(f).CreateSync(ctx, linuxPod)
    ginkgo.By("verifying pod external connectivity to the internet")
    ginkgo.By("checking connectivity to 8.8.8.8 53 (google.com) from Linux")
    assertConsistentConnectivity(ctx, f, linuxPod.ObjectMeta.Name, linuxOS, linuxCheck("8.8.8.8", 53))
    })
    ginkgo.It("should provide Internet connection for Windows containers using DNS [Feature:Networking-DNS]", func(ctx context.Context) {
    windowsPod := createTestPod(f, windowsBusyBoximage, windowsOS)
    ginkgo.By("creating a windows pod and waiting for it to be running")
    windowsPod = e2epod.NewPodClient(f).CreateSync(ctx, windowsPod)
    ginkgo.By("verifying pod external connectivity to the internet")
    ginkgo.By("checking connectivity to 8.8.8.8 53 (google.com) from Windows")
    assertConsistentConnectivity(ctx, f, windowsPod.ObjectMeta.Name, windowsOS, windowsCheck("www.google.com"))
    })
  • ginkgo.By("checking connectivity to 8.8.8.8 53 (google.com) from Linux")
    assertConsistentConnectivity(ctx, f, nginxPod.ObjectMeta.Name, "linux", linuxCheck("8.8.8.8", 53))
    ginkgo.By("checking connectivity to www.google.com from Windows")
    assertConsistentConnectivity(ctx, f, agnPod.ObjectMeta.Name, "windows", windowsCheck("www.google.com"))
    ginkgo.By("checking connectivity from Linux to Windows for the first time")
    assertConsistentConnectivity(ctx, f, nginxPod.ObjectMeta.Name, "linux", linuxCheck(agnPod.Status.PodIP, 80))
  • ginkgo.By(fmt.Sprintf("checking connectivity Pod to curl http://%s:%d", nodeIP, nodePort))
    assertConsistentConnectivity(ctx, f, testPod.ObjectMeta.Name, windowsOS, windowsCheck(fmt.Sprintf("http://%s", net.JoinHostPort(nodeIP, strconv.Itoa(nodePort)))))

We definitely don't want retries for pod <--> pod or NodePort connectivity checks, since it will hide Kubernetes internal networking flakiness.

But maybe calls to 8.8.8.8:53 or www.google.com should be retried.
What do you think?

@jsturtevant
Copy link
Contributor

But maybe calls to 8.8.8.8:53 or www.google.com should be retried.
What do you think?

if we want to add re-tries for external calls that works. Do the changes here need to be adjusted for this?

@ionutbalutoiu
Copy link
Contributor Author

But maybe calls to 8.8.8.8:53 or www.google.com should be retried.
What do you think?

if we want to add re-tries for external calls that works. Do the changes here need to be adjusted for this?

Yes, I'll have to adjust the PR with retries only for external calls. I'll reply after that it's done.

…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
@ionutbalutoiu ionutbalutoiu force-pushed the windows-tests/conn-check-retries branch from b5e0b0a to 8e5b959 Compare October 30, 2023 15:15
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 30, 2023
@ionutbalutoiu
Copy link
Contributor Author

But maybe calls to 8.8.8.8:53 or www.google.com should be retried.
What do you think?

if we want to add re-tries for external calls that works. Do the changes here need to be adjusted for this?

Yes, I'll have to adjust the PR with retries only for external calls. I'll reply after that it's done.

Updated the PR.

@jsturtevant please review the changes when you get the chance.

@jsturtevant
Copy link
Contributor

/lgtm
/assign @marosset @aravindhp

ionutbalutoiu added a commit to ionutbalutoiu/kubernetes that referenced this pull request Nov 10, 2023
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
ionutbalutoiu added a commit to e2e-win/kubernetes that referenced this pull request Nov 10, 2023
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
ionutbalutoiu added a commit to e2e-win/kubernetes that referenced this pull request Nov 10, 2023
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
@jsturtevant
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2023
@marosset
Copy link
Contributor

/ok-to-test
/milestone v1.29

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 14, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Nov 14, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 14, 2023
@ionutbalutoiu
Copy link
Contributor Author

@ionutbalutoiu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-capz-windows-master 8e5b959 link false /test pull-kubernetes-e2e-capz-windows-master
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

This error doesn't seem related to the current PR changes:

Kubernetes e2e suite: [It] [sig-api-machinery] ResourceQuota should apply changes to a resourcequota status [Conformance] 

[FAILED] client rate limiter Wait returned an error: context deadline exceeded
In [It] at: test/e2e/apimachinery/resource_quota.go:1191 @ 11/14/23 19:12:39.771

@ionutbalutoiu
Copy link
Contributor Author

/retest

@jsturtevant
Copy link
Contributor

There is a quay outage that is causing issues with pulling cert-manage during management cluster boot up. So this may not pass this round

@hailkomputer
Copy link
Member

Hello @marosset , @jsturtevant and @ionutbalutoiu , as we are in the code and test freeze, I am clearing the milestone for the v1.29. If you think this will be worked on and needed in the next release, feel free to add the milestone for the next candidate.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.29 milestone Nov 17, 2023
@jsturtevant
Copy link
Contributor

not urgent for this release we can get it in as soon as the branches open
/milestone v1.30

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Nov 17, 2023
@jsturtevant
Copy link
Contributor

/test pull-kubernetes-e2e-capz-windows-master

@ionutbalutoiu
Copy link
Contributor Author

/retest

ionutbalutoiu added a commit to e2e-win/kubernetes that referenced this pull request Nov 29, 2023
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
ionutbalutoiu added a commit to e2e-win/kubernetes that referenced this pull request Dec 5, 2023
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
@marosset
Copy link
Contributor

/retest
Let's see if we can get this merged

@marosset
Copy link
Contributor

/approve

@marosset
Copy link
Contributor

/check-required-labels

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, ionutbalutoiu, marosset

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit c1fff40 into kubernetes:master Jan 17, 2024
16 checks passed
iankingori pushed a commit to e2e-win/kubernetes that referenced this pull request Apr 4, 2024
…func

Add retry logic to the `assertConsistentConnectivity` function from
the `test/e2e/windows/hybrid_network.go` file.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
(cherry picked from PR kubernetes#120254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants