Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] Networking Granular Checks: Services should function for pod-Service(hostNetwork): udp #100889

Closed
thejoycekung opened this issue Apr 7, 2021 · 15 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@thejoycekung
Copy link
Contributor

Which jobs are flaking:

ci-kubernetes-e2e-gce-cos-k8sbeta-default

Which test(s) are flaking:

Networking Granular Checks: Services should function for pod-Service(hostNetwork): udp

Testgrid link:

https://testgrid.k8s.io/sig-release-1.21-blocking#gce-cos-k8sbeta-default

Reason for failure:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/networking.go:471
Apr  7 11:07:22.238: Unexpected error:
    <*errors.errorString | 0xc00033e240>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/network/utils.go:858

Anything else we need to know:

Looks like this test got less flaky over the course of this week:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Networking%20Granular%20Checks:%20Services%20should%20function%20for%20pod-Service#8c567bd35a2df3c096d2

But we've noticed several flakes for it on this job over the past couple days:

@thejoycekung thejoycekung added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 7, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 7, 2021
@k8s-ci-robot
Copy link
Contributor

@thejoycekung: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 7, 2021
@thejoycekung
Copy link
Contributor Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 7, 2021
@palnabarun
Copy link
Member

/milestone v1.21

(marking in the current milestone until we get confirmation whether it is relevant to the release or not)

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Apr 7, 2021
@palnabarun
Copy link
Member

@prameshj
Copy link
Contributor

prameshj commented Apr 7, 2021

This is failing due to the client pod unable to get scheduled. From test logs:

Mar 31 18:19:57.135: INFO: The status of Pod netserver-0 is Pending, waiting for it to be Running (with Ready = true)
Mar 31 18:19:58.927: INFO: The status of Pod netserver-0 is Pending, waiting for it to be Running (with Ready = true)
Mar 31 18:19:59.176: INFO: The status of Pod netserver-0 is Pending, waiting for it to be Running (with Ready = true)
Mar 31 18:19:59.177: FAIL: Unexpected error:
    <*errors.errorString | 0xc0001cc250>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

Full Stack Trace
k8s.io/kubernetes/test/e2e/framework/network.(*NetworkingTestConfig).createNetProxyPods(0xc000caa620, 0x6bc73c2, 0x9, 0xc002bd0d80, 0x0, 0x99b3fa0, 0x0)

Scheduler logs:

I0331 18:14:57.379017      10 eventhandlers.go:164] "Add event for unscheduled pod" pod="nettest-8284/netserver-0"
I0331 18:14:57.379333      10 scheduling_queue.go:849] "About to try and schedule pod" pod="nettest-8284/netserver-0"
I0331 18:14:57.379382      10 scheduler.go:459] "Attempting to schedule pod" pod="nettest-8284/netserver-0"
I0331 18:14:57.379730      10 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="nettest-8284/netserver-0" err="0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector."
I0331 18:14:57.379828      10 scheduler.go:357] "Updating pod condition" pod="nettest-8284/netserver-0" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I0331 18:14:58.210889      10 eventhandlers.go:164] "Add event for unscheduled pod" pod="nettest-8284/netserver-1"
I0331 18:14:58.211118      10 scheduling_queue.go:849] "About to try and schedule pod" pod="nettest-8284/netserver-1"
I0331 18:14:58.211163      10 scheduler.go:459] "Attempting to schedule pod" pod="nettest-8284/netserver-1"
I0331 18:14:58.212741      10 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="nettest-8284/netserver-1" err="0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector."

The podspec is here -

func (config *NetworkingTestConfig) createNetProxyPods(podName string, selector map[string]string) []*v1.Pod {

Most likely not release-blocking, will defer to someone else who might be more familiar with the test.

@bowei
Copy link
Member

bowei commented Apr 7, 2021

Did the cluster footprint change for the e2e test framework?

@prameshj
Copy link
Contributor

prameshj commented Apr 7, 2021

This is the cluster info from a passing test in master - https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1379895509761658880, that testgrid is green - https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default

This is from a failed run.

Same number of nodes, compute disks.

@aojea
Copy link
Member

aojea commented Apr 7, 2021

This is fixed by #100893 , let's wait some runs to confirm before closing

@prameshj
Copy link
Contributor

prameshj commented Apr 7, 2021

Thanks Antonio! Any idea why they are passing in master, even before the fix? - https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default

@aojea
Copy link
Member

aojea commented Apr 7, 2021

no clear idea, the jobs are not equal ... I can't understand why only happens in that job ..

@palnabarun
Copy link
Member

/milestone clear

(since this has been deemed to be non release-blocking)

@k8s-ci-robot k8s-ci-robot removed this from the v1.21 milestone Apr 8, 2021
@jayunit100
Copy link
Member

flakes significantlydropped off after 04/09.... so looking better i think ? i guess can wait another day ... https://testgrid.k8s.io/sig-release-1.21-blocking#gce-cos-k8sbeta-default

@aojea
Copy link
Member

aojea commented Apr 11, 2021

flakes significantlydropped off after 04/09.... so looking better i think ? i guess can wait another day ... https://testgrid.k8s.io/sig-release-1.21-blocking#gce-cos-k8sbeta-default

that is the weird part, the cherry-pick didn't merge yet #100908

@aojea
Copy link
Member

aojea commented Jun 2, 2021

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close
seems the cherry pick made it
https://testgrid.k8s.io/sig-release-1.21-blocking#gce-cos-k8sbeta-default&width=5&include-filter-by-regex=hostNetwork

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

7 participants