-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: [sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service #91236
Comments
/triage unresolved Comment 🤖 I am a bot run by vllry. 👩🔬 |
CRI-O CI is also hitting this failure. We have been unable to reproduce locally. our CI runs on aws |
hmm, I only see 1 failure out of 4 and seems that the connectivity didn't work at all 🤔
|
@thockin If you aren't able to handle this issue, consider unassigning yourself and/or adding the 🤖 I am a bot run by vllry. 👩🔬 |
/assign @jayunit100 @JacobTanenbaum |
@robscott: GitHub didn't allow me to assign the following users: JacobTanenbaum. Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc Saw the test flaking recently: Wondering if this is just a timeout issue. Maybe we can increase the timeout and/or switch to a TCP based test:
|
the test should use UDP, and UDP is unreliable by definition ... however is curious that most of the failures happen because it is not able to communicate with the first pod https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=continuous%20echo%20was%20not%20able%20to%20communicate%20with%20initial%20server%20pod#e76fa297acbadf09dac1
I think that is something that should be investigated, I've analyzed some occurences and it works fine for the second pod, however, the UDP losses should be totally random 🤔 |
Hi everyone, yeah so.. collecting some data on this...
So I guess the best reproducer is to run these tests, for now, in the cloud.... Now that I have a reproducer for this maybe can live debug it some time next wk with one of yall :) |
bear in mind that this test is connecting to a nodePort in the node IP, the test is functionally correct, it may be flaky due to the unreliable UDP nature or due to the pod scheduling, but if is failing constantly it has to be environmental. |
Second failure after two weeks of green 5k runs: |
/reopen The difference now is that we are testing NodePort and ClusterIP
|
@aojea: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We're hitting this really frequently in 5k-node tests. So I started looking a little bit into that.
That explains some delays, but it doesn't explain e.g. this run: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-proto/1273659998622191616 where iptables were programmed at
So I claim that timeouts are visibly to low in this test, but I suspect it's not everything. I will send out the first PR that tries to improve it in couple minutes. |
It seems that the packet never gets DNATed in the receiving host, so it does not reach the pod in the node. |
/triage accepted |
Yesterday's flake - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1318719364404350976 I don't think I'm seeing the log messages from the latest log PR you made @aojea ? Or are the prow logs not logging at a high-enough level? |
Logs are there, please bear with me:
click on the job failure and you'll see a larger output where you can see the server pod name
and get the node where the pod is running:
and the client-pod name and its node:
and the flow we are looking for:
with the node-name we can get the kube-proxy logs there, clicking in artifacts: and you can see the "new" conntrack log entries
and in the client: wow, there is a kernel bug there @wojtek-t @BenTheElder , can it be related? 🤔
|
wohoo - some progress :) |
This kernel panic may be a coincidence, maybe this is the reason #96174 |
Spotted another flake today: |
Another flake in the 5k job: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1351862443671818240. |
Which jobs are flaking:
ci-kubernetes-e2e-gce-scale-correctness
Which test(s) are flaking:
[sig-network] Services should be able to preserve UDP traffic when server pod cycles for a NodePort service
Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-correctness
Reason for failure:
/sig network
/assign @thockin
The text was updated successfully, but these errors were encountered: