New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flake: periodic-kubernetes-containerd-conformance-test-ppc64le, e2e conformance test in /e2e/network/service.go related to session affinity #112442
Comments
@Rajalakshmi-Girish: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@iXinqi any idea on this failure? |
@mkumatag ^^ |
can it be an underresourced environment? |
We are running the conformance suite on a multi-node cluster(one master and two workers) created with 32Gb Memory and 8vcpus for each node.
@aojea do you mean the bug #112412? |
Those tests are completely green on the other jobs, and they use to be affected by environmental constraints problems, that is why I assumed wrongly it was the same environment. This job install external components like Calico, and these tests only seem to fail in this job , and ppc64 is not officially supported... I can't help much here, happy to answer questions though (if I can help ) |
Here is exact error for this failure: Sep 19 10:49:57.861: INFO: Running '/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615'
Sep 19 10:50:02.468: INFO: rc: 1
Sep 19 10:50:02.468: INFO: Service reachability failing with error: error running /home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615:
Command stdout:
stderr:
+ + echo hostName
nc -v -t -w 2 192.168.160.102 32615
nc: connect to 192.168.160.102 port 32615 (tcp) timed out: Operation in progress
command terminated with exit code 1
error:
exit status 1 Seems like it is timing out while reaching to the node port service running on |
yeah, the underlay network or a constrained environment use to be the usual suspects , but those things use to hit randomly, if is consistent, there was a bug in IBM platforms and openshift CI, related to security groups dropping packets ... just throwing ideas |
It's not consistent, that's what driving me crazy.. security group is not involved here, it's all in the same private network. I guess something wrong with interpod connection, will debug this further.. |
those are the most difficult ones ... I can't see components logs on the job, if you have kube-proxy logs that may give us a hint |
unfortunately haven't implemented |
I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment) |
good point, @Rajalakshmi-Girish ^^ |
kube-proxy-logs.txt @aojea any clue from these logs, please? |
nothing suspicious on the logs if is an intermittent network problem you have to repro and trace all the path from the origin to the destination to check where it is failing, it seems all tests involve |
@aojea I tried running this suite on the latest version of calico (v.3.24.0). But it is flaking there too! |
@Rajalakshmi-Girish can you try to run with this patch? |
@aojea It is failing even with this patch.
ginkgo --nodes=4 --until-it-fails e2e.test -- --kubeconfig= --ginkgo.focus="Services\ should\ have\ session\ affinity" --report-dir=/root/artifacts |
How is the output error with that patch? |
@aojea Please find the attached complete output |
/remove-sig testing SIG Testing owns the Test Framework, CI infrastructure, test tools, etc. We are not responsible for individual test cases or external CI. /sig network |
/assign @aojea now that I found a problem, I'm curious if we have something wrong on the tests |
It seems most tests are skipped because there are no nodes available, conformance requires 2 nodes minimum
|
My bad! I assumed they ran well as the cluster had 2 workers and one master. |
no worries, you can filter the SCTP tests, those are not going to work based on the logs |
Ok I shall run SCTP tests and update here. |
@aojea yes 10 out of 16 tests filtered with SCTP regex failed. Please find the attached log. |
heh, sorry, I meant just the opposite, do not run SCTP tests 🙃 focus="[sig-network]\ Networking\ Granular\ Checks:" |
Granular_network_serial.txt |
@aojea Did you get a chance to look at the run results. These tests are still flaking in our jobs. |
no time sorry, this kind of problems usually are solved when you can reproduce them on demand, so you can check all the network path and components |
is this still alive? |
yes, and one of our n/w person is debugging it. |
@mkumatag @Rajalakshmi-Girish can you paste the kube-proxy config? |
I'm mostly interested in knowing the value of |
I have fetched the config details from the configmap by name
@aojea is this the information you were looking for? |
If you add verbosity level 2 to kube-proxy you should be able to find the exact value on the logs, to discard the flags are overriding the value |
@aojea
|
Nah, not the same thing:/ |
I happen to create a new cluster today with the current Though the test As the lines https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970 considers only two nodes for testing, the test PASSES when the issue is only with one node(Sometimes the Thus, I would like to reiterate that these tests have been flaking on our environment for more than a couple of months :| |
can you elaborate on that, how the exec failed and the test didn't pick up it? I can see that it returns an error if fails kubernetes/test/e2e/framework/service/jig.go Lines 867 to 870 in 3e26e10
this is a downstream environment, is not much that community can do more than help with suggestions without access the environment, there are a lot of external factors that can influence the problem, you mentioned that there was a n/w person in your side debugging it, what are their findings? maybe that can give us some leads |
As mentioned in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970, The function
The debugging is in progress. Will share here once there are any findings. |
sorry, this is the part I'm not getting, is the test running that |
Yes the test runs the nc commands at https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L905:#L908 When the func Thus if there is a nc failure using the third node, the test would still PASS. |
but is this something you are seeing or a hypothesis? |
We are seeing this on the clusters created with Though the test run against such an environment PASSES, It still has If the test happens to choose the nodeIP that has a failure, The test FAILs. |
we can't test each nodePort of a cluster because that doesn't scale, these tests run in clusters with 5k, 10k and 15k nodes There is some trade offs and this approach was working well for a long time, we can always revisit if we detect a considerable rate of false positives , but if that is happening in your environment then you have to investigate why that is happening , dump iptables rules in all nodes and see if they are different and there is a bug on the code or an environmental problem with that specific node. |
Should we close this, then? |
Is https://issues.redhat.com/browse/OCPBUGS-4503 related to this? |
some of the logs say you are using calico, I didn't check recent runs, but if you are using ovn-kubernetes , then it seems so |
Which jobs are flaking?
https://prow.k8s.io/job-history/gs/ppc64le-kubernetes/logs/periodic-kubernetes-containerd-conformance-test-ppc64le
Which tests are flaking?
Since when has it been flaking?
probably from a fortnight.
Testgrid link
https://k8s-testgrid.appspot.com/ibm-conformance-ppc64le#Periodic%20Conformance%20on%20ppc64le%20with%20containerd%20as%20runtime
Reason for failure (if possible)
No response
Anything else we need to know?
below is the failure:
Relevant SIG(s)
/sig testing
The text was updated successfully, but these errors were encountered: