Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test] [sig-network] Networking Granular Checks: Services should update endpoints: http (gce-ubuntu-master-containerd,master-blocking) #123760

Closed
SubhasmitaSw opened this issue Mar 6, 2024 · 50 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@SubhasmitaSw
Copy link
Contributor

SubhasmitaSw commented Mar 6, 2024

Which jobs are flaking?

gce-ubuntu-master-containerd

Prow: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1765185224909524992

Which tests are flaking?

Kubernetes e2e suite.[It] [sig-network] Networking Granular Checks: Services should update endpoints: http

Since when has it been flaking?

03-06 (intermittently)

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd

Reason for failure (if possible)

{ failed [FAILED] failed dialing endpoint (recovery), did not find expected responses... Tries 46 Command curl -g -q -s 'http://10.64.2.79:9080/dial?request=hostname&protocol=http&host=10.0.6.39&port=80&tries=1' retrieved map[bootstrap-e2e-minion-group-5dtl:{} netserver-1:{} netserver-2:{} netserver-3:{}] expected map[netserver-1:{} netserver-2:{} netserver-3:{}] In [It] at: k8s.io/kubernetes/test/e2e/network/networking.go:341 @ 03/06/24 01:32:57.492 }

Anything else we need to know?

multiple failures observed intermittently in the Traige dashboard: https://storage.googleapis.com/k8s-triage/index.html?test=Services%20should%20update%20endpoints

Relevant SIG(s)

/sig network release

@SubhasmitaSw SubhasmitaSw added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 6, 2024
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu
Copy link
Member

pacoxu commented Mar 6, 2024

/cc @aojea

@pacoxu
Copy link
Member

pacoxu commented Mar 7, 2024

/retitle [Flaking Test] [sig-network] Networking Granular Checks: Services should update endpoints: http (gce-ubuntu-master-containerd,master-blocking)

@k8s-ci-robot k8s-ci-robot changed the title [Flaking Test] gce-ubuntu-master-containerd on sig-release-master-blocking [Flaking Test] [sig-network] Networking Granular Checks: Services should update endpoints: http (gce-ubuntu-master-containerd,master-blocking) Mar 7, 2024
@pacoxu
Copy link
Member

pacoxu commented Mar 29, 2024

Kubernetes e2e suite: [It] [sig-network] Networking Granular Checks: Services should update endpoints: http expand_less	2m52s
{ failed [FAILED] failed dialing endpoint (recovery), did not find expected responses... 
Tries 46
Command curl -g -q -s 'http://10.64.4.249:9080/dial?request=hostname&protocol=http&host=10.0.20.184&port=80&tries=1'
retrieved map[bootstrap-e2e-minion-group-38hl:{} netserver-1:{} netserver-2:{} netserver-3:{}]
expected map[netserver-1:{} netserver-2:{} netserver-3:{}]
In [It] at: k8s.io/kubernetes/test/e2e/network/networking.go:341 @ 03/28/24 09:37:59.542

ginkgo.It("should update endpoints: http", func(ctx context.Context) {
config := e2enetwork.NewNetworkingTestConfig(ctx, f)
ginkgo.By(fmt.Sprintf("dialing(http) %v --> %v:%v (config.clusterIP)", config.TestContainerPod.Name, config.ClusterIP, e2enetwork.ClusterHTTPPort))
err := config.DialFromTestContainer(ctx, "http", config.ClusterIP, e2enetwork.ClusterHTTPPort, config.MaxTries, 0, config.EndpointHostnames())
if err != nil {
framework.Failf("failed dialing endpoint (initial), %v", err)
}
ginkgo.By("Deleting a pod which, will be replaced with a new endpoint")
config.DeleteNetProxyPod(ctx)
ginkgo.By(fmt.Sprintf("dialing(http) %v --> %v:%v (config.clusterIP) (endpoint recovery)", config.TestContainerPod.Name, config.ClusterIP, e2enetwork.ClusterHTTPPort))
err = config.DialFromTestContainer(ctx, "http", config.ClusterIP, e2enetwork.ClusterHTTPPort, config.MaxTries, config.MaxTries, config.EndpointHostnames())
if err != nil {
framework.Failf("failed dialing endpoint (recovery), %v", err)
}
})

  • retrieved map[bootstrap-e2e-minion-group-38hl:{} netserver-1:{} netserver-2:{} netserver-3:{}]
  • expected map[netserver-1:{} netserver-2:{} netserver-3:{}]
I0328 09:37:59.758398 10045 resource.go:168] POD                 NODE                             PHASE    GRACE  CONDITIONS
I0328 09:37:59.758490 10045 resource.go:175] netserver-1         bootstrap-e2e-minion-group-6w7q  Running         [{PodReadyToStartContainers True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:11 +0000 UTC  } {Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  }]
I0328 09:37:59.758526 10045 resource.go:175] netserver-2         bootstrap-e2e-minion-group-bx1l  Running         [{PodReadyToStartContainers True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:11 +0000 UTC  } {Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  }]
I0328 09:37:59.758558 10045 resource.go:175] netserver-3         bootstrap-e2e-minion-group-xntx  Running         [{PodReadyToStartContainers True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:11 +0000 UTC  } {Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:30 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:10 +0000 UTC  }]
I0328 09:37:59.758593 10045 resource.go:175] test-container-pod  bootstrap-e2e-minion-group-38hl  Running         [{PodReadyToStartContainers True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:32 +0000 UTC  } {Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:31 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:32 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:32 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-03-28 09:35:31 +0000 UTC  }]

@aojea
Copy link
Member

aojea commented Apr 3, 2024

other one here https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124161/pull-kubernetes-e2e-gci-gce-ipvs/1775434062043811840

This looks worrysome, the fact we get a different endpoint , I was checking the kube-proxy logs and I don't see anything suspicious, it also fails with ipvs that may indicate the problem is not in the implementation ... can be that the terminated pod answers with the host hostname during termination?

/priority important-soon

@k8s-ci-robot
Copy link
Contributor

@aojea: The label(s) priority/important-soone cannot be applied, because the repository doesn't have them.

In response to this:

other one here https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124161/pull-kubernetes-e2e-gci-gce-ipvs/1775434062043811840

This looks worrysome, the fact we get a different endpoint , I was checking the kube-proxy logs and I don't see anything suspicious, it also fails with ipvs that may indicate the problem is not in the implementation ... can be that the terminated pod answers with the host hostname during termination?

/priority important-soone

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu
Copy link
Member

pacoxu commented Apr 8, 2024

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 8, 2024
@pacoxu
Copy link
Member

pacoxu commented Apr 8, 2024

BTW, @aojea, is this a release cut(next is rc.2 on this Thursday) blocker?

If not, will this target for v1.31?

@aojea
Copy link
Member

aojea commented Apr 9, 2024

I'm trying to understand what is going one with this, is really weird, it seems to only affect gce jobs?

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=id%20not%20find%20expected%20responses&test=Services%20should%20update%20endpoints

but still a high rate of failures

@aojea
Copy link
Member

aojea commented Apr 10, 2024

Analyzing two occurrences we can see that the problem happens on the pod that is being removed netserver-0 , it has a conflict on the endpoint update , and it picks the hostname of the host where that pod was runnning

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776408965454761984

Command curl -g -q -s 'http://10.64.1.47:9080/dial?request=hostname&protocol=http&host=10.0.90.204&port=80&tries=1'
retrieved map[bootstrap-e2e-minion-group-4991:{} netserver-1:{} netserver-2:{}]
expected map[netserver-1:{} netserver-2:{}]
In [It] at: k8s.io/kubernetes/test/e2e/network

I0406 01:01:45.740732 10281 dump.go:53] At 2024-04-06 00:58:47 +0000 UTC - event for netserver-0: {default-scheduler } Scheduled: Successfully assigned nettest-757/netserver-0 to bootstrap-e2e-minion-group-4991

I0406 01:01:45.740856 10281 dump.go:53] At 2024-04-06 00:59:57 +0000 UTC - event for netserver-0: {kubelet bootstrap-e2e-minion-group-4991} Killing: Stopping container webserver
I0406 01:01:45.740862 10281 dump.go:53] At 2024-04-06 00:59:57 +0000 UTC - event for node-port-service: {endpoint-controller } FailedToUpdateEndpoint: Failed to update endpoint nettest-757/node-port-service: Operation cannot be fulfilled on endpoints "node-port-service": the object has been modified; please apply your changes to the latest version and try again
I0406 01:01:45.740869 10281 dump.go:53] At 2024-04-06 00:59:57 +0000 UTC - event for session-affinity-service: {endpoint-controller } FailedToUpdateEndpoint: Failed to update endpoint nettest-757/session-affinity-service: Operation cannot be fulfilled on endpoints "session-affinity-service": the object has been modified; please apply your changes to the latest version and try again

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032

Command curl -g -q -s 'http://10.64.1.239:9080/dial?request=hostname&protocol=http&host=10.0.128.203&port=80&tries=1'
retrieved map[bootstrap-e2e-minion-group-n25x:{} netserver-1:{} netserver-2:{}]
expected map[netserver-1:{} netserver-2:{}])
[FAILED] failed dialing endpoint (recovery), did not find expected responses... 

I0405 20:54:14.917406 8295 dump.go:53] At 2024-04-05 20:51:31 +0000 UTC - event for netserver-0: {default-scheduler } Scheduled: Successfully assigned nettest-4350/netserver-0 to bootstrap-e2e-minion-group-n25x


I0405 20:54:14.917552 8295 dump.go:53] At 2024-04-05 20:52:23 +0000 UTC - event for netserver-0: {kubelet bootstrap-e2e-minion-group-n25x} Killing: Stopping container webserver
I0405 20:54:14.917559 8295 dump.go:53] At 2024-04-05 20:52:23 +0000 UTC - event for node-port-service: {endpoint-controller } FailedToUpdateEndpoint: Failed to update endpoint nettest-4350/node-port-service: Operation cannot be fulfilled on endpoints "node-port-service": the object has been modified; please apply your changes to the latest version and try again
I0405 20:54:14.917568 8295 dump.go:53] At 2024-04-05 20:52:23 +0000 UTC - event for session-affinity-service: {endpoint-controller } FailedToUpdateEndpoint: Failed to update endpoint nettest-4350/session-affinity-service: Operation cannot be fulfilled on endpoints "session-affinity-service": the object has been modified; please apply your changes to the latest version and try again

Talked with @danwinship offline about this, there are 2 possible theories

  1. Something is listening on the host and kube-proxy redirect the traffic to it, maybe some tests are running in parallel and interfere?
  2. The existing pod when it is terminating replies with different hostname

ginkgo.It("should update endpoints: http", func(ctx context.Context) {
config := e2enetwork.NewNetworkingTestConfig(ctx, f)
ginkgo.By(fmt.Sprintf("dialing(http) %v --> %v:%v (config.clusterIP)", config.TestContainerPod.Name, config.ClusterIP, e2enetwork.ClusterHTTPPort))
err := config.DialFromTestContainer(ctx, "http", config.ClusterIP, e2enetwork.ClusterHTTPPort, config.MaxTries, 0, config.EndpointHostnames())
if err != nil {
framework.Failf("failed dialing endpoint (initial), %v", err)
}
ginkgo.By("Deleting a pod which, will be replaced with a new endpoint")
config.DeleteNetProxyPod(ctx)
ginkgo.By(fmt.Sprintf("dialing(http) %v --> %v:%v (config.clusterIP) (endpoint recovery)", config.TestContainerPod.Name, config.ClusterIP, e2enetwork.ClusterHTTPPort))
err = config.DialFromTestContainer(ctx, "http", config.ClusterIP, e2enetwork.ClusterHTTPPort, config.MaxTries, config.MaxTries, config.EndpointHostnames())
if err != nil {
framework.Failf("failed dialing endpoint (recovery), %v", err)
}
})

These are the kube-proxy logs on the time things happen

I0405 20:52:11.811605       1 servicechangetracker.go:211] "Adding new service port" portName="nettest-4350/session-affinity-service:http" servicePort="10.0.150.200:80/TCP"
I0405 20:52:11.811682       1 servicechangetracker.go:211] "Adding new service port" portName="nettest-4350/session-affinity-service:udp" servicePort="10.0.150.200:90/UDP"
I0405 20:52:11.811708       1 servicechangetracker.go:235] "Removing service port" portName="services-6194/affinity-nodeport"
I0405 20:52:11.811724       1 servicechangetracker.go:211] "Adding new service port" portName="services-4150/nodeport-update-service:tcp-port" servicePort="10.0.44.156:80/TCP"
I0405 20:52:11.811742       1 servicechangetracker.go:211] "Adding new service port" portName="nettest-4350/node-port-service:http" servicePort="10.0.128.203:80/TCP"
I0405 20:52:11.811756       1 servicechangetracker.go:211] "Adding new service port" portName="nettest-4350/node-port-service:udp" servicePort="10.0.128.203:90/UDP"
I0405 20:52:11.827862       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:11.827912       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]
I0405 20:52:11.827984       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:11.828000       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]


I0405 20:52:31.591636       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]
I0405 20:52:31.591656       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:31.591695       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:udp" endpoints=["10.64.1.231:8081","10.64.3.142:8081"]
I0405 20:52:31.591708       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:http" endpoints=["10.64.1.231:8083","10.64.3.142:8083"]
I0405 20:52:31.591764       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:31.591779       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]
I0405 20:52:31.591814       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:udp" endpoints=["10.64.1.231:8081","10.64.3.142:8081"]
I0405 20:52:31.591828       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:http" endpoints=["10.64.1.231:8083","10.64.3.142:8083"]
I0405 20:52:31.591897       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="services-2353/externalip-test:http" endpoints=["10.64.1.7:9376","10.64.2.235:9376"]
I0405 20:52:31.591927       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="services-5061/up-down-1" endpoints=["10.64.1.12:9376","10.64.2.244:9376"]
I0405 20:52:31.591957       1 endpointschangetracker.go:317] "Deleted endpoint may have stale conntrack entries" portName="nettest-4350/session-affinity-service:udp" endpoint="10.64.2.207:8081"
I0405 20:52:31.591986       1 endpointschangetracker.go:317] "Deleted endpoint may have stale conntrack entries" portName="nettest-4350/node-port-service:udp" endpoint="10.64.2.207:8081"

@aojea
Copy link
Member

aojea commented Apr 10, 2024

hmm, the Pod is deleted with NewDeleteOptions{0}

func (config *NetworkingTestConfig) DeleteNetProxyPod(ctx context.Context) {
pod := config.EndpointPods[0]
framework.ExpectNoError(config.getPodClient().Delete(ctx, pod.Name, *metav1.NewDeleteOptions(0)))
config.EndpointPods = config.EndpointPods[1:]
// wait for pod being deleted.
err := e2epod.WaitForPodNotFoundInNamespace(ctx, config.f.ClientSet, pod.Name, config.Namespace, wait.ForeverTestTimeout)
if err != nil {
framework.Failf("Failed to delete %s pod: %v", pod.Name, err)
}
// wait for endpoint being removed.
err = framework.WaitForServiceEndpointsNum(ctx, config.f.ClientSet, config.Namespace, nodePortServiceName, len(config.EndpointPods), time.Second, wait.ForeverTestTimeout)
if err != nil {
framework.Failf("Failed to remove endpoint from service: %s", nodePortServiceName)
}
// wait for kube-proxy to catch up with the pod being deleted.
time.Sleep(5 * time.Second)

@aojea
Copy link
Member

aojea commented Apr 10, 2024

still going with https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032

first query after deletion happens at 20:52:29.935143

I0405 20:52:24.805340 8295 util.go:427] Waiting for amount of service:node-port-service endpoints to be 2
STEP: dialing(http) test-container-pod --> 10.0.128.203:80 (config.clusterIP) (endpoint recovery) - k8s.io/kubernetes/test/e2e/network/networking.go:338 @ 04/05/24 20:52:29.889
I0405 20:52:29.935143 8295 exec_util.go:55] ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.64.1.239:9080/dial?request=hostname&protocol=http&host=10.0.128.203&port=80&tries=1'] Namespace:nettest-4350 PodName:test-container-pod ContainerName:webserver Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false Quiet:false}
I0405 20:52:29.935165 8295 util.go:506] >>> kubeConfig: /workspace/.kube/config

kube-proxy does not update the rules until 20:52:31.591636

I0405 20:52:31.591636       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]
I0405 20:52:31.591656       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:31.591695       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:udp" endpoints=["10.64.1.231:8081","10.64.3.142:8081"]
I0405 20:52:31.591708       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/node-port-service:http" endpoints=["10.64.1.231:8083","10.64.3.142:8083"]
I0405 20:52:31.591764       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:udp" endpoints=["10.64.1.231:8081","10.64.2.207:8081","10.64.3.142:8081"]
I0405 20:52:31.591779       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:http" endpoints=["10.64.1.231:8083","10.64.2.207:8083","10.64.3.142:8083"]
I0405 20:52:31.591814       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:udp" endpoints=["10.64.1.231:8081","10.64.3.142:8081"]
I0405 20:52:31.591828       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-4350/session-affinity-service:http" endpoints=["10.64.1.231:8083","10.64.3.142:8083"]
I0405 20:52:31.591897       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="services-2353/externalip-test:http" endpoints=["10.64.1.7:9376","10.64.2.235:9376"]
I0405 20:52:31.591927       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="services-5061/up-down-1" endpoints=["10.64.1.12:9376","10.64.2.244:9376"]
I0405 20:52:31.591957       1 endpointschangetracker.go:317] "Deleted endpoint may have stale conntrack entries" portName="nettest-4350/session-affinity-service:udp" endpoint="10.64.2.207:8081"
I0405 20:52:31.591986       1 endpointschangetracker.go:317] "Deleted endpoint may have stale conntrack entries" portName="nettest-4350/node-port-service:udp" endpoint="10.64.2.207:8081"

so it is possible the first request lands on the pod being deleted, the question is , can the process get a different hostname?

containerd logs to check if something happens on 20:52:29.935143 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1776347560789676032/artifacts/bootstrap-e2e-minion-group-n25x/containerd.log , cc: @smarterclayton

@aojea
Copy link
Member

aojea commented Apr 11, 2024

hmmm @pacoxu this only happens with master containerd

pr 05 20:40:07.615279 bootstrap-e2e-minion-group-n25x systemd[1]: Starting containerd container runtime...                                                         Apr 05 20:40:07.946826 bootstrap-e2e-minion-group-n25x systemd[1]: Started containerd container runtime.                                                            Apr 05 20:40:07.969997 bootstrap-e2e-minion-group-n25x containerd[1762]: time="2024-04-05T20:40:07.969850806Z" level=info msg="starting containerd" revision=d0cdb23fd29354a241c840ba91250e70617764ad version=v2.0.0-rc.0-41-gd0cdb23fd 

and stopped to happen last 4 days

https://storage.googleapis.com/k8s-triage/index.html?text=minion&test=Services%20should%20update%20endpoints

maybe related to this containerd/containerd#9999

The testgrid does not have any recent failure in the last 5 days too https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=should%20update%20endpoints&width=20

I'm going to close as:

  • it happens only on jobs using containerd master (no failure in kind jobs)
  • didn't happen in the last 5 days (since new release and critical fix on containerd)
  • it does seems a problem with containerd (kube-proxy and kubelet looks fine)

/close

please reopen if it happen again, IMPORTANT, the failure has to have the hostname in the map bootstrap-e2e-minion-group-5dt in this case

retrieved map[bootstrap-e2e-minion-group-5dtl:{} netserver-1:{} netserver-2:{} netserver-3:{}]

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

hmmm @pacoxu this only happens with master containerd

pr 05 20:40:07.615279 bootstrap-e2e-minion-group-n25x systemd[1]: Starting containerd container runtime...                                                         Apr 05 20:40:07.946826 bootstrap-e2e-minion-group-n25x systemd[1]: Started containerd container runtime.                                                            Apr 05 20:40:07.969997 bootstrap-e2e-minion-group-n25x containerd[1762]: time="2024-04-05T20:40:07.969850806Z" level=info msg="starting containerd" revision=d0cdb23fd29354a241c840ba91250e70617764ad version=v2.0.0-rc.0-41-gd0cdb23fd 

and stopped to happen last 4 days

https://storage.googleapis.com/k8s-triage/index.html?text=minion&test=Services%20should%20update%20endpoints

maybe related to this containerd/containerd#9999

The testgrid does not have any recent failure in the last 5 days too https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=should%20update%20endpoints&width=20

I'm going to close as:

  • it happens only on jobs using containerd master (no failure in kind jobs)
  • didn't happen in the last 5 days (since new release and critical fix on containerd)
  • it does seems a problem with containerd (kube-proxy and kubelet looks fine)

/close

please reopen if it happen again, IMPORTANT, the failure has to have the hostname in the map bootstrap-e2e-minion-group-5dt in this case

retrieved map[bootstrap-e2e-minion-group-5dtl:{} netserver-1:{} netserver-2:{} netserver-3:{}]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@aojea: Reopened this issue.

In response to this:

/reopen

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124424/pull-kubernetes-e2e-gci-gce-ipvs/1781739030539407360

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 21, 2024
@aojea
Copy link
Member

aojea commented Apr 21, 2024

/assign @aojea

This is really weird, maybe there is something more worrisome behind

@aroradaman
Copy link
Member

aroradaman commented Apr 26, 2024

@aojea Is there any reason why we use config.MaxTries instead of 0 for the minTries parameter of DialFromTestContainer. Tried blaming, but couldn't find any associated history.

err = config.DialFromTestContainer(ctx, "http", config.ClusterIP, e2enetwork.ClusterHTTPPort, config.MaxTries, config.MaxTries, config.EndpointHostnames())

After deleting the pod/endpoint the condition is met within 8 seconds but we wait for 3 minutes and keep on polling hostnames. If the test fails due to inference or any race condition the chances of that are increased by using config.MaxTries for minTries paramter.

Why the node hostname is retrieved is a different story, I'm still looking into it.

@aroradaman
Copy link
Member

aroradaman commented Apr 26, 2024

@aroradaman
Copy link
Member

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1785325927534694400
This job failed with some debug logs which were added in https://github.com/kubernetes/kubernetes/pull/124583/files.

The unexpected host hostname is received in the first attempt to get the hostnames, the first attempt happens immediately after deleting a pod.

[command=curl+-g+-q+-s+%27http%3A%2F%2F10.64.3.217%3A9080%2Fdial%3Frequest%3Dhostname%26protocol%3Dhttp%26host%3D10.0.98.211%26port%3D80%26tries%3D1%27&container=webserver&container=webserver&stderr=true&stdout=true](https://35.203.136.98/api/v1/namespaces/nettest-5038/pods/test-container-pod/exec?command=%2Fbin%2Fsh&command=-c&command=curl+-g+-q+-s+%27http%3A%2F%2F10.64.3.217%3A9080%2Fdial%3Frequest%3Dhostname%26protocol%3Dhttp%26host%3D10.0.98.211%26port%3D80%26tries%3D1%27&container=webserver&container=webserver&stderr=true&stdout=true))
  I0430 15:25:39.690698 10163 utils.go:330] encountered error during dial (received unexpected responses... 
  Attempt 0
  Command curl -g -q -s 'http://10.64.3.217:9080/dial?request=hostname&protocol=http&host=10.0.98.211&port=80&tries=1'
  retrieved map[bootstrap-e2e-minion-group-5szn:{}]
  expected map[netserver-1:{} netserver-2:{}])
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/network/networking.go:341 @ 04/30/24 15:25:39.69
 

@aroradaman
Copy link
Member

@aojea
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/124660/pull-kubernetes-e2e-gci-gce-ipvs/1785847254955855872
I tried to capture events of the failed job:

containerd logs, container was removed at 02:25:32.903349

May 02 02:25:32.877139 bootstrap-e2e-minion-group-6qdg containerd[4288]: time="2024-05-02T02:25:32.877075742Z" level=info msg="RemovePodSandbox for \"947979eab98cdbaa30a23dcbac470fee2603d26f802be6eabc3c94ddd22adf13\""
May 02 02:25:32.877317 bootstrap-e2e-minion-group-6qdg containerd[4288]: time="2024-05-02T02:25:32.877124474Z" level=info msg="Forcibly stopping sandbox \"947979eab98cdbaa30a23dcbac470fee2603d26f802be6eabc3c94ddd22adf13\""
May 02 02:25:32.891977 bootstrap-e2e-minion-group-6qdg containerd[4288]: time="2024-05-02T02:25:32.891876911Z" level=info msg="TearDown network for sandbox \"947979eab98cdbaa30a23dcbac470fee2603d26f802be6eabc3c94ddd22adf13\" successfully"
May 02 02:25:32.903087 bootstrap-e2e-minion-group-6qdg containerd[4288]: time="2024-05-02T02:25:32.902997355Z" level=warning msg="Failed to get podSandbox status for container event for sandboxID \"947979eab98cdbaa30a23dcbac470fee2603d26f802be6eabc3c94ddd22adf13\": an error occurred when try to find sandbox: not found. Sending the event with nil podSandboxStatus."

kube-proxy logs, endpointslice cache was updated at 02:24:43.497430 and proxy rules were updated at 02:24:43.771588

I0502 02:24:43.497430       1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-2332/node-port-service:udp" endpoints=["10.64.1.42:8081","10.64.2.37:8081","10.64.3.80:8081"]
<sync proxy rules complete>
I0502 02:24:43.771588       1 proxier.go:929] "syncProxyRules complete" ipFamily="IPv4" elapsed="274.539498ms"

e2e logs, the curl request which received unexpected hostname was made at 02:24:41.144381 roughly 2.5 seconds before proxy rules were updated.

I0502 02:24:41.144381 65737 exec_util.go:59] ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.64.4.44:9080/dial?request=hostname&protocol=http&host=10.0.17.151&port=80&tries=1'] Namespace:nettest-2332 PodName:test-container-pod ContainerName:webserver Stdin: CaptureStdout:true CaptureStderr:true PreserveWhitespace:false Quiet:false}

netserver-0 pod logs

I0502 02:24:33.275000 65737 networking.go:356] ================ start of pod log for netserver-0 ================
  I0502 02:24:33.275025 65737 networking.go:357] 
  		I0502 02:23:47.400180       1 log.go:245] Started HTTP server on port 8083
  		I0502 02:23:47.400962       1 log.go:245] Started UDP server on port  8081
  		I0502 02:24:06.161598       1 log.go:245] GET /healthz
  		I0502 02:24:06.161646       1 log.go:245] GET /healthz
  		I0502 02:24:16.175435       1 log.go:245] GET /healthz
  		I0502 02:24:16.175902       1 log.go:245] GET /healthz
  		I0502 02:24:25.329948       1 log.go:245] GET /hostname
  		I0502 02:24:25.329998       1 log.go:245] hostname: netserver-0
  		I0502 02:24:26.159759       1 log.go:245] GET /healthz
  		I0502 02:24:26.161205       1 log.go:245] GET /healthz
  I0502 02:24:33.275031 65737 networking.go:358] ================ end of pod log for netserver-0 ================

The container and container network were removed immediately, I think there is some other pod/service that replies with the host hostname resulting in the failure.

@aojea
Copy link
Member

aojea commented May 5, 2024

I think there is some other pod/service that replies with the host hostname resulting in the failure.

Ok, things that we know, the request is forwarded from kube-proxy to the same endpoint (same PodIP netserver-0 has)

The process that replies to the request is listening in the PodIP of netserver-0 , in the same networking namespace? or was a new namespace created with a veth with that PodIP? I checked containerd logs and I didn't see that IP got relocated or another pod was created during that time frame, that PodIP is released when the Pod is removed, so (we need to verify this is containerd logs) that IP should not be allocated to other Pod

@danwinship
Copy link
Contributor

Ok, things that we know, the request is forwarded from kube-proxy to the same endpoint (same PodIP netserver-0 has)

where do we know that from?

that IP should not be allocated to other Pod

but earlier:

I'm trying to understand what is going one with this, is really weird, it seems to only affect gce jobs?

How exactly do pod IPs work on gce? Is it possible that immediately after a pod gets deleted, that traffic to that pod's IP could accidentally get delivered somewhere else? In most circumstances that might not get noticed if the pods aren't listening on the same ports, but all of the e2enetwork..NewNetworkingTestConfig()-based tests use the same ports...

(What other tests were running concurrently with the failed test?)

@uablrek
Copy link
Contributor

uablrek commented May 5, 2024

Since the returned hostname if from the node, my bet is on a server running with hostNetwork:true. That could well happen with the race described in #121245, but that assumes that the service IP exist on some interface on the node. It does for proxy-mode=ipvs on kube-ipvs0, but what if there are other cases, or a kube-ipvs0 from an older run lingers.

@danwinship
Copy link
Contributor

The service IP shouldn't exist on any interface if we aren't using ipvs, but it's more plausible that the pod IP could. Maybe when the pod netns is destroyed, its veth ends up in the host netns briefly before being destroyed? (Though that's explicitly not supposed to happen for virtual interface types.)

@danwinship
Copy link
Contributor

the curl request which received unexpected hostname was made at 02:24:41.144381

That was during the run of "[It] should function for service endpoints using hostNetwork", which uses e2enetwork.NewNetworkingTestConfig(ctx, f, e2enetwork.UseHostNetwork) and thus would be answering with the node hostname on all interfaces in the host netns. The plot thickens...

> Enter [It] should function for service endpoints using hostNetwork - k8s.io/kubernetes/test/e2e/network/networking.go:506 @ 05/02/24 02:23:54.492
...
STEP: Creating the service pods in kubernetes - k8s.io/kubernetes/test/e2e/framework/network/utils.go:770 @ 05/02/24 02:23:54.492
...
I0502 02:24:42.961552 65733 exec_util.go:59] ExecWithOptions {Command:[/bin/sh -c curl -g -q -s 'http://10.64.3.117:9080/dial?request=hostname&protocol=http&host=10.40.0.6&port=32692&tries=1'] Namespace:nettest-1937 PodName:test-container-pod ContainerName:webserver Stdin:<ni> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false Quiet:false}
...
STEP: Destroying namespace "nettest-1937" for this suite. - k8s.io/kubernetes/test/e2e/framework/framework.go:360 @ 05/02/24 02:25:42.864

@aojea
Copy link
Member

aojea commented May 5, 2024

Ok, things that we know, the request is forwarded from kube-proxy to the same endpoint (same PodIP netserver-0 has)

where do we know that from?

the request goes to the clusterip, and the endpoints are the same as they didn't change, hence the request has to be sent to the same PodIP, no?

Daman example is with IPVS, let's see one with iptables https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1786906482093068288

Bad request at 00:05:29.9824

I0505 00:05:29.982400 8275 exec_util.go:83] ExecWithOptions: execute(POST https://35.247.28.38/api/v1/namespaces/nettest-8396/pods/test-container-pod/exec?command=%2Fbin%2Fsh&command=-c&command=curl+-g+-q+-s+%27http%3A%2F%2F10.64.2.204%3A9080%2Fdial%3Frequest%3Dhostname%26protocol%3Dhttp%26host%3D10.0.225.118%26port%3D80%26tries%3D1%27&container=webserver&container=webserver&stderr=true&stdout=true)
I0505 00:05:31.229222 8275 utils.go:330] encountered error during dial (received unexpected responses...

The request is done from test-container-pod: {kubelet bootstrap-e2e-minion-group-pmh5} so those are the kube-proxy logs we are interested, as the iptables rules will swap the cluster ip 10.0.225.118 to the endpoint ip in that host

STEP: dialing(http) test-container-pod --> 10.0.225.118:80 (config.clusterIP) - k8s.io/kubernetes/test/e2e/network/networking.go:362 @ 05/05/24 00:05:10.649

The endpoints for the service are programmed at

I0505 00:05:10.649071 1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-8396/node-port-service:http" endpoints=["10.64.1.166:8083","10.64.2.176:8083","10.64.3.174:8083"]

and updated with the endpoint removal at 00:05:30

I0505 00:05:30.991821 1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-8396/node-port-service:udp" endpoints=["10.64.1.166:8081","10.64.2.176:8081","10.64.3.174:8081"]
I0505 00:05:30.991855 1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-8396/node-port-service:http" endpoints=["10.64.1.166:8083","10.64.2.176:8083","10.64.3.174:8083"]
I0505 00:05:30.992143 1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-8396/node-port-service:udp" endpoints=["10.64.1.166:8081","10.64.2.176:8081"]
I0505 00:05:30.992168 1 endpointslicecache.go:348] "Setting endpoints for service port name" portName="nettest-8396/node-port-service:http" endpoints=["10.64.1.166:8083","10.64.2.176:8083"]
...
I0505 00:05:30.992565 1 endpointschangetracker.go:317] "Deleted endpoint may have stale conntrack entries" portName="nettest-8396/node-port-service:udp" endpoint="10.64.3.174:8081"

The IP removed was 10.64.3.174, so netserver-0 should have that IP, we need to validate that in the kubelet or containerd logs. Containerd logs here The node were the Pod deleted is running

I0505 00:05:31.329760 8275 dump.go:53] At 2024-05-05 00:05:21 +0000 UTC - event for netserver-0: {kubelet bootstrap-e2e-minion-group-l2p5} Killing: Stopping container webserver

10.64.3.174 is assigned two times, but the second time is at 00:11:01.250407

May 05 00:04:41.674088 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:04:41.672530158Z" level=debug msg="cni result: {"Interfaces":{"eth0":{"IPConfigs":[{"IP":"10.64.3.174","Gateway":"10.64.3.1"}],"Mac":"22:e1:8a:51:f7:41","Sandbox":"/var/run/netns/cni-176704ce-8f15-5bf1-58dc-319a5597af60"},"lo":{"IPConfigs":[{"IP":"127.0.0.1","Gateway":""},{"IP":"::1","Gateway":""}],"Mac":"00:00:00:00:00:00","Sandbox":"/var/run/netns/cni-176704ce-8f15-5bf1-58dc-319a5597af60"},"veth718e1f30":{"IPConfigs":null,"Mac":"ba:4a:cb:1d:46:73","Sandbox":""}},"DNS":[{},{}],"Routes":[{"dst":"0.0.0.0/0"}]}" podsandboxid=51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370

sandbox 51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370 is io.kubernetes.cri.sandbox-namespace:nettest-8396 and io.kubernetes.cri.sandbox-name:netserver-0

May 05 00:11:01.250407 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:11:01.250353481Z" level=debug msg="cni result: {"Interfaces":{"eth0":{"IPConfigs":[{"IP":"10.64.3.174","Gateway":"10.64.3.1"}],"Mac":"f2:1e:21:9e:d4:cc","Sandbox":"/var/run/netns/cni-441c001f-0c58-328e-0412-5f0641b47747"},"lo":{"IPConfigs":[{"IP":"127.0.0.1","Gateway":""},{"IP":"::1","Gateway":""}],"Mac":"00:00:00:00:00:00","Sandbox":"/var/run/netns/cni-441c001f-0c58-328e-0412-5f0641b47747"},"veth2e98844a":{"IPConfigs":null,"Mac":"f6:fb:0b:06:82:28","Sandbox":""}},"DNS":[{},{}],"Routes":[{"dst":"0.0.0.0/0"}]}" podsandboxid=1552fe4404c5a80b15edb1d9f6e427476bcc5ea49421c150886be240c7ec6dcf

is io.kubernetes.cri.sandbox-name:pod2 and ``io.kubernetes.cri.sandbox-namespace:services-7021 `

The pod sandbox is removed at 00:05:22.031109

May 05 00:05:22.031109 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:22.031041586Z" level=info msg="TearDown network for sandbox "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370" successfully"
May 05 00:05:22.031461 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:22.031366503Z" level=info msg="StopPodSandbox for "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370" returns successfully"

that seems to match the pod logs

  I0505 00:05:21.510479       1 log.go:245] hostname: netserver-0

I0505 00:05:22.533916 8275 networking.go:358] ================ end of pod log for netserver-0 ================

however, there are still more events in containerd removing the mentioning sandbox that sounds weird,

May 05 00:05:52.185719 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:52.185652738Z" level=info msg="TearDown network for sandbox "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370" successfully"
May 05 00:05:52.185719 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:52.185715741Z" level=info msg="StopPodSandbox for "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370" returns successfully"
May 05 00:05:52.187731 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:52.187681351Z" level=info msg="RemovePodSandbox for "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370""
May 05 00:05:52.187890 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-05T00:05:52.187744270Z" level=info msg="Forcibly stopping sandbox "51a122a6746dff47eb22db55e355a468e89118d22e280cb87f1606f738a18370""

and the container is not completely removed until later, it may happen then CNI DEL didn't work and the Pod is still alive until 05T00:05:52.1877?

The kubelet consider the pod killed at 00:05:23.218742

May 05 00:05:23.218742 bootstrap-e2e-minion-group-l2p5 kubelet[4178]: I0505 00:05:23.218673 4178 pod_workers.go:840] "Pod is finished processing, no further updates" pod="nettest-8396/netserver-0" podUID="4bc1c6ce-de7c-4134-b2f8-70c670646de7" updateType="kill"

The last message from that pod uid 4bc1c6ce-de7c-4134-b2f8-70c670646de7 is at

May 05 00:05:54.127808 bootstrap-e2e-minion-group-l2p5 kubelet[4178]: I0505 00:05:54.124815 4178 kuberuntime_gc.go:343] "Removing pod logs" podUID="4bc1c6ce-de7c-4134-b2f8-70c670646de7"

The way GCE jobs setup the network using the containerd cni template

May 04 23:59:13.574993 bootstrap-e2e-minion-group-l2p5 containerd[3559]: time="2024-05-04T23:59:13.574755588Z" level=info msg="Generating cni config from template "/home/containerd/opt/containerd/cluster/gce/cni.template""

that use the ptp plugin

local cni_template_path="${KUBE_HOME}/cni.template"
cat > "${cni_template_path}" <<EOF
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "ptp",
"mtu": 1460,
"ipam": {
"type": "host-local",
"subnet": "{{.PodCIDR}}",
"routes": [
{
"dst": "0.0.0.0/0"
}
]
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}

Installing a route to the podIP through the associated veth, example from other environment

10.244.1.70 dev veth53f7eefc scope host

@danwinship theory is also possible, the hostNetwork test runs at that time

   "LeafNodeText": "should function for service endpoints using hostNetwork",
    "State": "passed",
    "StartTime": "2024-05-05T00:03:29.838222603Z",
    "EndTime": "2024-05-05T00:05:57.899458908Z",

EDIT:

or that both things are happening, the containerd network teardown is not happening at that time so the PodIP is still available somehow and the hostnetwork pods is taking that request

@aroradaman
Copy link
Member

aroradaman commented May 6, 2024

Should we allow overriding listen address in agnhost via environment variables?
We can tweak e2enetwork.NewNetworkingTestConfig to use downward API and pass PodIPs?

or that both things are happening, the containerd network teardown is not happening at that time so the PodIP is still available somehow and the hostnetwork pods is taking that request

This would probably verify the latter.

@danwinship
Copy link
Contributor

agnost can already do whatever you want; the problem is that this set of tests always just passes the same ports.

we could tweak things to confirm that it makes the flake go away, but that doesn't actually fix the bug; something is going wrong that makes the flake even possible

jingczhang pushed a commit to nokia/kubernetes that referenced this issue May 7, 2024
Intermittent logging to understand root cause of
kubernetes#123760

Signed-off-by: Daman Arora <aroradaman@gmail.com>
@aojea
Copy link
Member

aojea commented May 9, 2024

we could tweak things to confirm that it makes the flake go away, but that doesn't actually fix the bug; something is going wrong that makes the flake even possible

reinforcing Dan words, the goal of the tests are not just to pass and be green , is to find issues so we can solve them ...

@aojea
Copy link
Member

aojea commented May 12, 2024

Ok, we got a hit https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1788466036081692672

Node for the Pod

I0509 07:30:08.231436 10011 dump.go:53] At 2024-05-09 07:29:11 +0000 UTC - event for netserver-0: {default-scheduler } Scheduled: Successfully assigned nettest-2035/netserver-0 to bootstrap-e2e-minion-group-1qj3

Service and endpoints

I0509 07:29:55.360397 1 endpointslicecache.go:302] "Setting endpoints for service port name" portName="nettest-2035/node-port-service:http" endpoints=["10.64.1.229:8083","10.64.2.184:8083","10.64.3.199:8083","10.64.4.186:8083"]
I0509 07:29:55.360443 1 endpointslicecache.go:302] "Setting endpoints for service port name" portName="nettest-2035/node-port-service:udp" endpoints=["10.64.1.229:8081","10.64.2.184:8081","10.64.3.199:8081"]
I0509 07:29:55.360457 1 endpointslicecache.go:302] "Setting endpoints for service port name" portName="nettest-2035/node-port-service:http" endpoints=["10.64.1.229:8083","10.64.2.184:8083","10.64.3.199:8083"]

PodIP https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1788466036081692672/artifacts/bootstrap-e2e-minion-group-1qj3/kubelet.log

May 09 07:29:22.453739 bootstrap-e2e-minion-group-1qj3 kubelet[8868]: I0509 07:29:22.453463 8868 status_manager.go:883] "Status for pod updated successfully" pod="nettest-2035/netserver-0" statusVersion=3 status={"phase":"Running","conditions":[{"type":"PodReadyToStartContainers","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-05-09T07:29:13Z"},{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-05-09T07:29:11Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-05-09T07:29:22Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-05-09T07:29:22Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-05-09T07:29:11Z"}],"hostIP":"10.40.0.6","hostIPs":[{"ip":"10.40.0.6"}],"podIP":"10.64.4.186","podIPs":[{"ip":"10.64.4.186"}],"startTime":"2024-05-09T07:29:11Z","containerStatuses":[{"name":"webserver","state":{"running":{"startedAt":"2024-05-09T07:29:12Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/agnhost:2.52","imageID":"registry.k8s.io/e2e-test-images/agnhost@sha256:b173c7d0ffe3d805d49f4dfe48375169b7b8d2e1feb81783efd61eb9d08042e6","containerID":"containerd://d6699d97a4ae512793c47cce095eb1ca4aa55347fb80c307e72c4a3737139b17","started":true}],"qosClass":"BestEffort"}

No trace of that podIP on the dumps

Dump network information for node bootstrap-e2e-minion-group-1qj3:
IP routes: 
default via 10.40.0.1 dev ens4 proto dhcp src 10.40.0.6 metric 100 
10.40.0.1 dev ens4 proto dhcp scope link src 10.40.0.6 metric 100 
10.64.4.2 dev veth42ce9d10 scope host 
10.64.4.4 dev veth71daa408 scope host 
10.64.4.5 dev vetha95098a8 scope host 
10.64.4.6 dev veth51f86df3 scope host 
10.64.4.8 dev vetha36f3657 scope host 
10.64.4.122 dev veth8e444465 scope host 
10.64.4.165 dev veth40ba58c4 scope host 
10.64.4.168 dev vethb86adbd5 scope host 
10.64.4.183 dev veth5ab0d922 scope host 
10.64.4.197 dev veth057fe915 scope host 
10.64.4.200 dev veth4149d025 scope host 
10.64.4.202 dev vethd758b365 scope host 
10.64.4.203 dev vethd40de13a scope host 
10.64.4.204 dev vethc995ab95 scope host 
169.254.169.254 via 10.40.0.1 dev ens4 proto dhcp src 10.40.0.6 metric 100 
IP addresses:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 169.254.169.252/32 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:28:00:06 brd ff:ff:ff:ff:ff:ff
    altname enp0s4
    inet 10.40.0.6/32 metric 100 scope global dynamic ens4
       valid_lft 85372sec preferred_lft 85372sec
    inet6 fe80::4001:aff:fe28:6/64 scope link 
       valid_lft forever preferred_lft forever
3: veth42ce9d10@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether 46:e8:51:12:0a:ea brd ff:ff:ff:ff:ff:ff link-netns cni-c792ec43-de22-c107-5def-99f2cec1894c
    inet 10.64.4.1/32 scope global veth42ce9d10
       valid_lft forever preferred_lft forever
    inet6 fe80::44e8:51ff:fe12:aea/64 scope link 
       valid_lft forever preferred_lft forever
5: veth71daa408@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether ae:23:95:89:e4:65 brd ff:ff:ff:ff:ff:ff link-netns cni-f569d054-bb99-fd3e-59a8-f165fbd890e2
    inet 10.64.4.1/32 scope global veth71daa408
       valid_lft forever preferred_lft forever
    inet6 fe80::ac23:95ff:fe89:e465/64 scope link 
       valid_lft forever preferred_lft forever
6: vetha95098a8@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether 6e:d6:96:03:1d:3b brd ff:ff:ff:ff:ff:ff link-netns cni-69fd0696-6c4a-31b0-0d2e-903514cbc61d
    inet 10.64.4.1/32 scope global vetha95098a8
       valid_lft forever preferred_lft forever
    inet6 fe80::6cd6:96ff:fe03:1d3b/64 scope link 
       valid_lft forever preferred_lft forever
7: veth51f86df3@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether 2e:22:5b:a8:5a:9c brd ff:ff:ff:ff:ff:ff link-netns cni-3335d51a-f4ff-6116-ab6c-9b68308a3ce2
    inet 10.64.4.1/32 scope global veth51f86df3
       valid_lft forever preferred_lft forever
    inet6 fe80::447a:40ff:fe7c:37e4/64 scope link 
       valid_lft forever preferred_lft forever
9: vetha36f3657@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether 8a:15:bc:d4:b9:b4 brd ff:ff:ff:ff:ff:ff link-netns cni-35c2cfbf-ecd8-9cbc-180b-cf166c7dc324
    inet 10.64.4.1/32 scope global vetha36f3657
       valid_lft forever preferred_lft forever
    inet6 fe80::8815:bcff:fed4:b9b4/64 scope link 
       valid_lft forever preferred_lft forever
370: veth8e444465@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether ee:af:ae:f0:32:70 brd ff:ff:ff:ff:ff:ff link-netns cni-716330ba-98b1-28f3-6c04-434dd26b118f
    inet 10.64.4.1/32 scope global veth8e444465
       valid_lft forever preferred_lft forever
    inet6 fe80::ecaf:aeff:fef0:3270/64 scope link 
       valid_lft forever preferred_lft forever
413: veth40ba58c4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether fe:3d:32:5b:4c:c9 brd ff:ff:ff:ff:ff:ff link-netns cni-66c4a195-626e-9e66-ae17-6d4ca74a684c
    inet 10.64.4.1/32 scope global veth40ba58c4
       valid_lft forever preferred_lft forever
    inet6 fe80::fc3d:32ff:fe5b:4cc9/64 scope link 
       valid_lft forever preferred_lft forever
416: vethb86adbd5@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether f6:2e:9e:db:3e:6f brd ff:ff:ff:ff:ff:ff link-netns cni-78930ab1-4b93-3551-ebe5-2cc8d39b194d
    inet 10.64.4.1/32 scope global vethb86adbd5
       valid_lft forever preferred_lft forever
    inet6 fe80::f42e:9eff:fedb:3e6f/64 scope link 
       valid_lft forever preferred_lft forever
431: veth5ab0d922@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether 1a:a6:2b:ef:ed:0d brd ff:ff:ff:ff:ff:ff link-netns cni-e5996763-5da5-5849-f166-0dce57e17fcb
    inet 10.64.4.1/32 scope global veth5ab0d922
       valid_lft forever preferred_lft forever
    inet6 fe80::18a6:2bff:feef:ed0d/64 scope link 
       valid_lft forever preferred_lft forever
445: veth057fe915@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether d6:fc:b9:0d:17:7d brd ff:ff:ff:ff:ff:ff link-netns cni-888764df-cf56-1dcf-9a0e-cc4057b634fa
    inet 10.64.4.1/32 scope global veth057fe915
       valid_lft forever preferred_lft forever
    inet6 fe80::d4fc:b9ff:fe0d:177d/64 scope link 
       valid_lft forever preferred_lft forever
448: veth4149d025@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether ca:75:cd:9b:53:bc brd ff:ff:ff:ff:ff:ff link-netns cni-b08014d4-f94e-5008-f7d2-21fd6753ca32
    inet 10.64.4.1/32 scope global veth4149d025
       valid_lft forever preferred_lft forever
    inet6 fe80::60a7:37ff:fe6a:8ab3/64 scope link 
       valid_lft forever preferred_lft forever
450: vethd758b365@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether e6:75:2e:44:04:78 brd ff:ff:ff:ff:ff:ff link-netns cni-b2a6ffab-09cc-35ae-1e84-f0d67ddcc40c
    inet 10.64.4.1/32 scope global vethd758b365
       valid_lft forever preferred_lft forever
    inet6 fe80::b837:b3ff:fe76:97/64 scope link 
       valid_lft forever preferred_lft forever
451: vethd40de13a@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether f2:26:8b:d3:77:76 brd ff:ff:ff:ff:ff:ff link-netns cni-67406e7a-dcbd-26a1-84e8-3785fe416ccc
    inet 10.64.4.1/32 scope global vethd40de13a
       valid_lft forever preferred_lft forever
    inet6 fe80::f026:8bff:fed3:7776/64 scope link 
       valid_lft forever preferred_lft forever
452: vethc995ab95@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default 
    link/ether ce:b7:41:49:61:04 brd ff:ff:ff:ff:ff:ff link-netns cni-9cacb7a6-8399-2a94-c2e6-a240506bfb87
    inet 10.64.4.1/32 scope global vethc995ab95
       valid_lft forever preferred_lft forever
    inet6 fe80::ccb7:41ff:fe49:6104/64 scope link tentative 
       valid_lft forever preferred_lft forever
Open sockets: 
State     Recv-Q Send-Q      Local Address:Port        Peer Address:Port Process                                   
LISTEN    0      4096        127.0.0.53%lo:53               0.0.0.0:*     users:(("systemd-resolve",pid=447,fd=14))
LISTEN    0      128               0.0.0.0:22               0.0.0.0:*     users:(("sshd",pid=885,fd=3))            
LISTEN    0      4096            127.0.0.1:16061            0.0.0.0:*     users:(("monitor",pid=9091,fd=6))        
LISTEN    0      4096            127.0.0.1:989              0.0.0.0:*     users:(("proxy",pid=9048,fd=3))          
LISTEN    0      4096            127.0.0.1:40293            0.0.0.0:*     users:(("containerd",pid=8687,fd=11))    
LISTEN    0      4096            127.0.0.1:10248            0.0.0.0:*     users:(("kubelet",pid=8868,fd=21))       
LISTEN    0      4096            127.0.0.1:10249            0.0.0.0:*     users:(("kube-proxy",pid=8962,fd=15))    
LISTEN    0      128               0.0.0.0:48301            0.0.0.0:*     users:(("rpc.statd",pid=81308,fd=8))     
LISTEN    0      128               0.0.0.0:111              0.0.0.0:*     users:(("rpcbind",pid=81303,fd=8))       
TIME-WAIT 0      0               127.0.0.1:35396          127.0.0.1:40293                                          
TIME-WAIT 0      0               10.64.4.1:58010        10.64.4.158:8083                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35404                                          
ESTAB     0      0               10.40.0.6:36306    169.254.169.254:80    users:(("google_guest_ag",pid=751,fd=14))
TIME-WAIT 0      0               10.40.0.6:34158      74.125.197.82:443                                            
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35372                                          
TIME-WAIT 0      0               127.0.0.1:35340          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:35376          127.0.0.1:40293                                          
ESTAB     0      0               10.40.0.6:42184     216.239.32.174:443   users:(("google_osconfig",pid=580,fd=8)) 
TIME-WAIT 0      0               10.40.0.6:34160      74.125.197.82:443                                            
ESTAB     0      0               127.0.0.1:38382          127.0.0.1:989   users:(("monitor",pid=9091,fd=8))        
TIME-WAIT 0      0               127.0.0.1:35334          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:35392          127.0.0.1:40293                                          
ESTAB     0      0               127.0.0.1:40293          127.0.0.1:35430 users:(("containerd",pid=8687,fd=95))    
TIME-WAIT 0      0               10.40.0.6:53358          10.40.0.6:8083                                           
TIME-WAIT 0      0               10.40.0.6:53348          10.40.0.6:8083                                           
ESTAB     0      0               127.0.0.1:40293          127.0.0.1:35436 users:(("containerd",pid=8687,fd=124))   
TIME-WAIT 0      0               127.0.0.1:35342          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35424                                          
TIME-WAIT 0      0               10.64.4.1:34066          10.64.4.4:8093                                           
ESTAB     0      0               10.40.0.6:47966     173.194.203.95:443   users:(("google_guest_ag",pid=751,fd=8)) 
TIME-WAIT 0      0               10.64.4.1:38942        10.64.4.185:80                                             
TIME-WAIT 0      0               10.40.0.6:53354          10.40.0.6:8083                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35412                                          
TIME-WAIT 0      0               127.0.0.1:35330          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35390                                          
ESTAB     0      0               10.40.0.6:36026     34.168.108.125:443   users:(("kube-proxy",pid=8962,fd=3))     
TIME-WAIT 0      0               10.40.0.6:34442      34.96.108.209:443                                            
TIME-WAIT 0      0               127.0.0.1:35378          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:35288          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35422                                          
TIME-WAIT 0      0               10.64.4.1:51616          10.64.4.2:8181                                           
ESTAB     0      0               10.40.0.6:36020     34.168.108.125:443   users:(("kubelet",pid=8868,fd=14))       
TIME-WAIT 0      0               127.0.0.1:35322          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35380                                          
TIME-WAIT 0      0               127.0.0.1:35346          127.0.0.1:40293                                          
TIME-WAIT 0      0               10.64.4.1:34074          10.64.4.4:8093                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35386                                          
TIME-WAIT 0      0               127.0.0.1:34108          127.0.0.1:10248                                          
TIME-WAIT 0      0               10.40.0.6:53350          10.40.0.6:8083                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35416                                          
TIME-WAIT 0      0               10.64.4.1:38732        10.64.4.177:9898                                           
ESTAB     0      0               127.0.0.1:35430          127.0.0.1:40293 users:(("kubelet",pid=8868,fd=29))       
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35432                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35398                                          
TIME-WAIT 0      0               10.40.0.6:34440      34.96.108.209:443                                            
TIME-WAIT 0      0               127.0.0.1:35290          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:35394          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:34104          127.0.0.1:10248                                          
TIME-WAIT 0      0               127.0.0.1:34112          127.0.0.1:10248                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35408                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35410                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35384                                          
TIME-WAIT 0      0               10.64.4.1:38944        10.64.4.185:80                                             
TIME-WAIT 0      0               10.64.4.1:34062          10.64.4.4:8093                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35400                                          
ESTAB     0      0               127.0.0.1:35436          127.0.0.1:40293 users:(("kubelet",pid=8868,fd=31))       
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35414                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35388                                          
TIME-WAIT 0      0               127.0.0.1:34106          127.0.0.1:10248                                          
TIME-WAIT 0      0               10.64.4.1:56504        10.64.4.198:8444                                           
ESTAB     0      0               10.40.0.6:36292    169.254.169.254:80    users:(("google_osconfig",pid=580,fd=3)) 
TIME-WAIT 0      0               10.40.0.6:53352          10.40.0.6:8083                                           
TIME-WAIT 0      0               10.64.4.1:38842          10.64.4.2:8080                                           
TIME-WAIT 0      0               10.64.4.1:51622          10.64.4.2:8181                                           
TIME-WAIT 0      0               10.40.0.6:53366          10.40.0.6:8083                                           
TIME-WAIT 0      0               10.64.4.1:38836          10.64.4.2:8080                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35428                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35382                                          
TIME-WAIT 0      0               127.0.0.1:35344          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:35326          127.0.0.1:40293                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35402                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35374                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35406                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35426                                          
TIME-WAIT 0      0               10.40.0.6:34438      34.96.108.209:443                                            
TIME-WAIT 0      0               10.40.0.6:34152      74.125.197.82:443                                            
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35420                                          
TIME-WAIT 0      0               10.40.0.6:53356          10.40.0.6:8083                                           
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35418                                          
TIME-WAIT 0      0               10.40.0.6:53368          10.40.0.6:8083                                           
ESTAB     0      0               127.0.0.1:989            127.0.0.1:38382 users:(("proxy",pid=9048,fd=6))          
TIME-WAIT 0      0               127.0.0.1:34102          127.0.0.1:10248                                          
TIME-WAIT 0      0               127.0.0.1:34114          127.0.0.1:10248                                          
TIME-WAIT 0      0               127.0.0.1:40293          127.0.0.1:35434                                          
LISTEN    0      128                  [::]:54993               [::]:*     users:(("rpc.statd",pid=81308,fd=10))    
LISTEN    0      4096                    *:8083                   *:*     users:(("agnhost",pid=127642,fd=3))      
LISTEN    0      128                  [::]:22                  [::]:*     users:(("sshd",pid=885,fd=4))            
LISTEN    0      4096                    *:988                    *:*     users:(("proxy",pid=9048,fd=5))          
LISTEN    0      4096                    *:10250                  *:*     users:(("kubelet",pid=8868,fd=23))       
LISTEN    0      4096                    *:6061                   *:*     users:(("monitor",pid=9091,fd=5))        
LISTEN    0      128                  [::]:111                 [::]:*     users:(("rpcbind",pid=81303,fd=11))      
LISTEN    0      4096                    *:10255                  *:*     users:(("kubelet",pid=8868,fd=24))       
LISTEN    0      4096                    *:10256                  *:*     users:(("kube-proxy",pid=8962,fd=10))    
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54366                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53366                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51636                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59512                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58092                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58104                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59526                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54356                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51644                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51642                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53370                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58108                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53360                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59520                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58102                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51638                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59510                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53376                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58110                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59522                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53362                                          
ESTAB     0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.4]:35272 users:(("kubelet",pid=8868,fd=3))        
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59528                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59524                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54364                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.4.4]:58098                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53372                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51632                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54358                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:8083  [::ffff:10.40.0.6]:53364                                          
ESTAB     0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51648 users:(("kubelet",pid=8868,fd=30))       
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59518                                          
ESTAB     0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54070 users:(("kubelet",pid=8868,fd=19))       
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.3.2]:59514                                          
ESTAB     0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.2.3]:51646 users:(("kubelet",pid=8868,fd=26))       
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54360                                          
TIME-WAIT 0      0      [::ffff:10.40.0.6]:10250 [::ffff:10.64.1.2]:54362                                          

we´ll have to dump the iptables rules too

@aojea
Copy link
Member

aojea commented May 25, 2024

I don't see much failures lately 🤔

@uablrek
Copy link
Contributor

uablrek commented May 25, 2024

A heisenbug?

@aroradaman
Copy link
Member

I don't see much failures lately 🤔

yes, this is a case where I'm desperately waiting for a test to fail/flake :p
( to get more debug logs)

@aroradaman
Copy link
Member

@aojea https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-e2e-ubuntu-gce/1794418456712450048 failed today but iptables and contrack flow were not dumped 😔

// Dump the node iptables rules and conntrack flows for troubleshooting #123760
podList, _ := config.f.ClientSet.CoreV1().Pods("kube-system").List(ctx, metav1.ListOptions{
LabelSelector: "k8s-app=kube-proxy",
})
for _, pod := range podList.Items {
// dump only for the node running test-container-pod
if pod.Status.HostIP == config.TestContainerPod.Status.HostIP {
output, _, _ := e2epod.ExecWithOptions(config.f, e2epod.ExecOptions{
Namespace: "kube-system",
PodName: pod.Name,
ContainerName: "kube-proxy",
Command: []string{"sh", "-c", fmt.Sprintf(`echo "IPTables Dump: " && iptables-save | grep "%s/%s:http" && echo "Conntrack flows: " && conntrack -Ln -p tcp | grep %d`, config.Namespace, config.NodePortService.Name, EndpointHTTPPort)},
Stdin: nil,
CaptureStdout: true,
CaptureStderr: true,
PreserveWhitespace: false,
})
framework.Logf("Dump iptables and connntrack flows\n%s", output)
break
}
}
return returnMsg

@wendy-ha18
Copy link
Member

@aojea this test failed again recently 30-05-2024: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1796412325301850112

It seems like a failure in dialing a specific HTTP endpoint and not finding the expected responses. Could you have a look on it?

{ failed [FAILED] failed dialing endpoint (recovery), did not find expected responses... 
Tries 46
Command curl -g -q -s 'http://10.64.4.65:9080/dial?request=hostname&protocol=http&host=10.0.79.230&port=80&tries=1'
retrieved map[netserver-1:{} netserver-2:{} netserver-3:{}]
expected map[netserver-1:{} netserver-2:{} netserver-3:{}]
In [It] at: k8s.io/kubernetes/test/e2e/network/networking.go:341 @ 05/31/24 05:39:28.731
}

@aojea
Copy link
Member

aojea commented May 31, 2024

The error is misleading, if we scroll up in the logs we see the failures is beceause can not find containerd socket

s+%27http%3A%2F%2F10.64.4.65%3A9080%2Fdial%3Frequest%3Dhostname%26protocol%3Dhttp%26host%3D10.0.79.230%26port%3D80%26tries%3D1%27&container=webserver&container=webserver&stderr=true&stdout=true)
I0531 05:39:28.731594 10367 utils.go:320] GetResponseFromContainer: failed to execute "curl -g -q -s 'http://10.64.4.65:9080/dial?request=hostname&protocol=http&host=10.0.79.230&port=80&tries=1'": unable to upgrade connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory", stdout: "", stderr: ""
I0531 05:39:28.731652 10367 utils.go:383] encountered error during dial (did not find expected responses... 
Tries 46
Command curl -g -q -s 'http://10.64.4.65:9080/dial?request=hostname&protocol=http&host=10.0.79.230&port=80&tries=1'
retrieved map[netserver-1:{} netserver-2:{} netserver-3:{}]
expected map[netserver-1:{} netserver-2:{} netserver-3:{}])
[FAILED] failed dialing endpoint (recovery), did not find expected responses... 
Tries 46

@wendy-ha18
Copy link
Member

I think the containerd socket in the log may prefer to a different issue: #125228, which also failed at the same time today @aojea . Or they may relate to each other.

If you take a look in the testgrid here: https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd, there are 02 tests failed, one of them is this: Kubernetes e2e suite.[It] [sig-network] Networking Granular Checks: Services should update endpoints: http

@Vyom-Yadav
Copy link
Member

@wendy-ha18 Both tests are failing due to the same reason. Testgrid just shows a snippet of the stderr, so it can mislead sometimes :)

@aojea
Copy link
Member

aojea commented Jul 18, 2024

How is this doing?

I can see this test green for last week https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=Granular%20Checks

and no flakes https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=master&test=Services%20should%20update%20endpoints

I still think this was some issue with containerd, as this is building containerd from master branch

@aroradaman can we remove all the debugging things we added?

/close

we can reopen if it happens again

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

How is this doing?

I can see this test green for last week https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd&include-filter-by-regex=Granular%20Checks

and no flakes https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=master&test=Services%20should%20update%20endpoints

I still think this was some issue with containerd, as this is building containerd from master branch

@aroradaman can we remove all the debugging things we added?

/close

we can reopen if it happens again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aroradaman
Copy link
Member

@aojea sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
Development

No branches or pull requests

9 participants