-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows-tests: Add retries to Windows assertConsistentConnectivity
…
#120254
windows-tests: Add retries to Windows assertConsistentConnectivity
…
#120254
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @ionutbalutoiu. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @jsturtevant, @marosset |
are we sure something else isn't going on? We haven't ever seen flakes on that set of tests in the release jobs: https://testgrid.k8s.io/sig-windows-signal#capz-windows-master&include-filter-by-regex=Hybrid%20 |
As far as I can remember, it is what I've described happened. However, it was on very rare occasions that this piece of code: kubernetes/test/e2e/windows/hybrid_network.go Line 122 in 370c85f
failed when destination is And the failure is happening outside Kubernetes scope, due to the fact that we are reaching the wide internet, and transient networking errors can happen (very rarely, but it can happen). Perhaps that's why the original code author left a kubernetes/test/e2e/windows/hybrid_network.go Lines 112 to 115 in b2499a1
I spawned a testing cluster, and I'm trying to reproduce the failure using I'm targeting these tests: I will report back when I have some concrete tests results. |
Thanks! I generally think it makes sense to add the additional retry since it is going external and things can fail. Since we haven't really ever seen it in our release pipelines I would like to double check we aren't masking a different issue that is causing it to be more prevalent tin the pipelines you are running. |
I have some results now. So, using the ltsc2022-containerd-flannel-sdnoverlay-stable Prowjob config, I ran the tests continuously in the previous days, and they never failed (4538 attempts):
And, using the aks-ltsc2022-azurecni-1.27 Prowjob config, I got the following (this executed for less time):
I stopped both testing clusters, since I didn't get a repro.
I'm a bit inclined to have retires, but only for the calls that go external. However, I don't have a strong opinion on this anymore, given that I wasn't able to easily reproduce the failure. For example, the function
We definitely don't want retries for But maybe calls to |
if we want to add re-tries for external calls that works. Do the changes here need to be adjusted for this? |
Yes, I'll have to adjust the PR with retries only for external calls. I'll reply after that it's done. |
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
b5e0b0a
to
8e5b959
Compare
Updated the PR. @jsturtevant please review the changes when you get the chance. |
/lgtm |
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
/hold cancel |
/ok-to-test |
This error doesn't seem related to the current PR changes:
|
/retest |
There is a quay outage that is causing issues with pulling cert-manage during management cluster boot up. So this may not pass this round |
Hello @marosset , @jsturtevant and @ionutbalutoiu , as we are in the code and test freeze, I am clearing the milestone for the v1.29. If you think this will be worked on and needed in the next release, feel free to add the milestone for the next candidate. /milestone clear |
not urgent for this release we can get it in as soon as the branches open |
/test pull-kubernetes-e2e-capz-windows-master |
/retest |
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
/retest |
/approve |
/check-required-labels |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aravindhp, ionutbalutoiu, marosset The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…func Add retry logic to the `assertConsistentConnectivity` function from the `test/e2e/windows/hybrid_network.go` file. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com> (cherry picked from PR kubernetes#120254)
What type of PR is this?
/kind flake
/sig testing
/sig windows
What this PR does / why we need it:
This pull request addresses a
TODO
item from thetest/e2e/windows/hybrid_network.go
file.The max retries count was chosen to be similar with the value from:
kubernetes/test/e2e/network/netpol/test_helper.go
Lines 68 to 71 in 1144c85
In my experience by monitoring the sig-windows-networking Prowjobs, the following pieces of code will rarely fail, if we don't address the
TODO
item:kubernetes/test/e2e/windows/hybrid_network.go
Line 87 in 370c85f
kubernetes/test/e2e/windows/hybrid_network.go
Line 98 in 370c85f
This is happening because in the
assertConsistentConnectivity
function, we have:kubernetes/test/e2e/windows/hybrid_network.go
Line 122 in 370c85f
The
gomega.Consistently
expects theconnChecker
to successfully query destination for theduration
(10 secs).However, if the destination is
8.8.8.8
orwww.google.com
, we can expect failures, so the retries logic is necessary in theconnChecker
.Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
I have this patch already running since a while on sig-windows-networking Prowjobs, and I didn't see the described flakiness scenario reproduced at all.
Unfortunately, I don't have any testgrid links with the rare occasions when failures occurred without this patch.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: