-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaking Test] [sig-network] Networking Granular Checks: Services should update endpoints: http (gce-ubuntu-master-containerd,master-blocking) #123760
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @aojea |
/retitle [Flaking Test] [sig-network] Networking Granular Checks: Services should update endpoints: http (gce-ubuntu-master-containerd,master-blocking) |
It flaked in ci-kubernetes-e2e-gci-gce-alpha-enabled-default before.
|
kubernetes/test/e2e/network/networking.go Lines 328 to 343 in 227c2e7
|
other one here https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124161/pull-kubernetes-e2e-gci-gce-ipvs/1775434062043811840 This looks worrysome, the fact we get a different endpoint , I was checking the kube-proxy logs and I don't see anything suspicious, it also fails with ipvs that may indicate the problem is not in the implementation ... can be that the terminated pod answers with the host hostname during termination? /priority important-soon |
@aojea: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-soon |
BTW, @aojea, is this a release cut(next is rc.2 on this Thursday) blocker? If not, will this target for v1.31? |
I'm trying to understand what is going one with this, is really weird, it seems to only affect gce jobs? but still a high rate of failures |
Analyzing two occurrences we can see that the problem happens on the pod that is being removed https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776408965454761984
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032
Talked with @danwinship offline about this, there are 2 possible theories
kubernetes/test/e2e/network/networking.go Lines 328 to 343 in 9791f0d
These are the kube-proxy logs on the time things happen
|
hmm, the Pod is deleted with NewDeleteOptions{0} kubernetes/test/e2e/framework/network/utils.go Lines 886 to 901 in be4b717
|
still going with https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032 first query after deletion happens at 20:52:29.935143
kube-proxy does not update the rules until 20:52:31.591636
so it is possible the first request lands on the pod being deleted, the question is , can the process get a different hostname? containerd logs to check if something happens on 20:52:29.935143 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032/artifacts/bootstrap-e2e-minion-group-n25x/containerd.log , cc: @smarterclayton |
hmmm @pacoxu this only happens with master containerd
and stopped to happen last 4 days maybe related to this containerd/containerd#9999 The testgrid does not have any recent failure in the last 5 days too https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=should%20update%20endpoints&width=20 I'm going to close as:
/close please reopen if it happen again, IMPORTANT, the failure has to have the hostname in the map
|
@aojea: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@aojea: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @aojea This is really weird, maybe there is something more worrisome behind |
@aojea Is there any reason why we use kubernetes/test/e2e/network/networking.go Line 339 in ba05a8d
After deleting the pod/endpoint the condition is met within 8 seconds but we wait for 3 minutes and keep on polling hostnames. If the test fails due to inference or any race condition the chances of that are increased by using Why the node hostname is retrieved is a different story, I'm still looking into it. |
This job uses stable containerd v1.7.15 |
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1785325927534694400 The unexpected host hostname is received in the first attempt to get the hostnames, the first attempt happens immediately after deleting a pod.
|
@aojea containerd logs, container was removed at 02:25:32.903349
kube-proxy logs, endpointslice cache was updated at 02:24:43.497430 and proxy rules were updated at 02:24:43.771588
e2e logs, the curl request which received unexpected hostname was made at 02:24:41.144381 roughly 2.5 seconds before proxy rules were updated.
netserver-0 pod logs
The container and container network were removed immediately, I think there is some other pod/service that replies with the host hostname resulting in the failure. |
Ok, things that we know, the request is forwarded from kube-proxy to the same endpoint (same PodIP netserver-0 has) The process that replies to the request is listening in the PodIP of netserver-0 , in the same networking namespace? or was a new namespace created with a veth with that PodIP? I checked containerd logs and I didn't see that IP got relocated or another pod was created during that time frame, that PodIP is released when the Pod is removed, so (we need to verify this is containerd logs) that IP should not be allocated to other Pod |
where do we know that from?
but earlier:
How exactly do pod IPs work on gce? Is it possible that immediately after a pod gets deleted, that traffic to that pod's IP could accidentally get delivered somewhere else? In most circumstances that might not get noticed if the pods aren't listening on the same ports, but all of the (What other tests were running concurrently with the failed test?) |
Since the returned hostname if from the node, my bet is on a server running with |
The service IP shouldn't exist on any interface if we aren't using ipvs, but it's more plausible that the pod IP could. Maybe when the pod netns is destroyed, its veth ends up in the host netns briefly before being destroyed? (Though that's explicitly not supposed to happen for virtual interface types.) |
That was during the run of "[It] should function for service endpoints using hostNetwork", which uses
|
the request goes to the clusterip, and the endpoints are the same as they didn't change, hence the request has to be sent to the same PodIP, no? Daman example is with IPVS, let's see one with iptables https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1786906482093068288 Bad request at 00:05:29.9824
The request is done from
The endpoints for the service are programmed at
and updated with the endpoint removal at 00:05:30
The IP removed was 10.64.3.174, so netserver-0 should have that IP, we need to validate that in the kubelet or containerd logs. Containerd logs here The node were the Pod deleted is running
10.64.3.174 is assigned two times, but the second time is at
sandbox 51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370 is
is The pod sandbox is removed at 00:05:22.031109
that seems to match the pod logs
however, there are still more events in containerd removing the mentioning sandbox that sounds weird,
and the container is not completely removed until later, it may happen then CNI DEL didn't work and the Pod is still alive until The kubelet consider the pod killed at 00:05:23.218742
The last message from that pod uid
The way GCE jobs setup the network using the containerd cni template
that use the ptp plugin kubernetes/cluster/gce/gci/configure-helper.sh Lines 3226 to 3252 in ade0d21
Installing a route to the podIP through the associated veth, example from other environment
@danwinship theory is also possible, the hostNetwork test runs at that time
EDIT: or that both things are happening, the containerd network teardown is not happening at that time so the PodIP is still available somehow and the hostnetwork pods is taking that request |
Should we allow overriding listen address in agnhost via environment variables?
This would probably verify the latter. |
agnost can already do whatever you want; the problem is that this set of tests always just passes the same ports. we could tweak things to confirm that it makes the flake go away, but that doesn't actually fix the bug; something is going wrong that makes the flake even possible |
Intermittent logging to understand root cause of kubernetes#123760 Signed-off-by: Daman Arora <aroradaman@gmail.com>
reinforcing Dan words, the goal of the tests are not just to pass and be green , is to find issues so we can solve them ... |
Ok, we got a hit https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1788466036081692672 Node for the Pod
Service and endpoints
No trace of that podIP on the dumps
we´ll have to dump the iptables rules too |
I don't see much failures lately 🤔 |
A heisenbug? |
yes, this is a case where I'm desperately waiting for a test to fail/flake :p |
@aojea https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1794418456712450048 failed today but iptables and contrack flow were not dumped 😔 kubernetes/test/e2e/framework/network/utils.go Lines 345 to 366 in 4bb4345
|
@aojea this test failed again recently 30-05-2024: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112 It seems like a failure in dialing a specific HTTP endpoint and not finding the expected responses. Could you have a look on it?
|
The error is misleading, if we scroll up in the logs we see the failures is beceause can not find containerd socket
|
I think the containerd socket in the log may prefer to a different issue: #125228, which also failed at the same time today @aojea . Or they may relate to each other. If you take a look in the testgrid here: https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd, there are 02 tests failed, one of them is this: |
@wendy-ha18 Both tests are failing due to the same reason. Testgrid just shows a snippet of the stderr, so it can mislead sometimes :) |
How is this doing? I can see this test green for last week https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=Granular%20Checks and no flakes https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=master&test=Services%20should%20update%20endpoints I still think this was some issue with containerd, as this is building containerd from master branch @aroradaman can we remove all the debugging things we added? /close we can reopen if it happens again |
@aojea: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@aojea sure. |
Revert debug steps and logs for #123760
Which jobs are flaking?
gce-ubuntu-master-containerd
Prow: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1765185224909524992
Which tests are flaking?
Kubernetes e2e suite.[It] [sig-network] Networking Granular Checks: Services should update endpoints: http
Since when has it been flaking?
03-06 (intermittently)
Testgrid link
https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd
Reason for failure (if possible)
{ failed [FAILED] failed dialing endpoint (recovery), did not find expected responses... Tries 46 Command curl -g -q -s 'http://10.64.2.79:9080/dial?request=hostname&protocol=http&host=10.0.6.39&port=80&tries=1' retrieved map[bootstrap-e2e-minion-group-5dtl:{} netserver-1:{} netserver-2:{} netserver-3:{}] expected map[netserver-1:{} netserver-2:{} netserver-3:{}] In [It] at: k8s.io/kubernetes/test/e2e/network/networking.go:341 @ 03/06/24 01:32:57.492 }
Anything else we need to know?
multiple failures observed intermittently in the Traige dashboard: https://storage.googleapis.com/k8s-triage/index.html?test=Services%20should%20update%20endpoints
Relevant SIG(s)
/sig network release
The text was updated successfully, but these errors were encountered: