Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

session affinity tests flaking, possibly broken due to gradual e2e-gce slowdown? #83617

Open
danwinship opened this issue Oct 8, 2019 · 0 comments

Comments

@danwinship
Copy link
Contributor

commented Oct 8, 2019

Which jobs are failing:
pull-kubernetes-e2e-gce

Which test(s) are failing:

[sig-network] Services should have session affinity work for NodePort service [LinuxOnly]
[sig-network] Services should be able to switch session affinity for service with type clusterIP [LinuxOnly]
[sig-network] Services should be able to switch session affinity for NodePort service [LinuxOnly]
[sig-network] Services should have session affinity work for service with type clusterIP [LinuxOnly]

Since when has it been failing:
At least since 9/30 (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1178921594651676674). But that's almost at the end of what's currently visible on the testgrid, so maybe longer

Testgrid link:
Not sure what I'm supposed to put here. https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce ?

Reason for failure:
The test case connects to a service every 2 seconds for 1 minute and checks if it hit the same endpoint at least 15 times in a row. But the individual checks are taking unexpectedly long and so sometimes the test doesn't even manage to run 15 times total in 1 minute. eg:

Oct  8 13:32:01.402: INFO: Running '/go/src/k8s.io/kubernetes/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.143.14 --kubeconfig=/workspace/.kube/config exec --namespace=services-1060 execpod-affinityklpfn -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://10.40.0.5:31686/'
Oct  8 13:32:07.627: INFO: stderr: "+ curl -q -s --connect-timeout 2 http://10.40.0.5:31686/\n"
Oct  8 13:32:07.627: INFO: stdout: "affinity-nodeport-2z9hv"
Oct  8 13:32:07.627: INFO: Received response from host: affinity-nodeport-2z9hv
Oct  8 13:32:09.402: INFO: Running '/go/src/k8s.io/kubernetes/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.143.14 --kubeconfig=/workspace/.kube/config exec --namespace=services-1060 execpod-affinityklpfn -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://10.40.0.5:31686/'
Oct  8 13:32:16.027: INFO: stderr: "+ curl -q -s --connect-timeout 2 http://10.40.0.5:31686/\n"
Oct  8 13:32:16.027: INFO: stdout: "affinity-nodeport-2z9hv"
Oct  8 13:32:16.027: INFO: Received response from host: affinity-nodeport-2z9hv

Note the gap between the "Running" line and the "stderr" line. It's not clear if the slowness is in connecting to the execpod or in connecting from the execpod to the affinity service. It is also not particularly consistent between attempts. (Here it was 6s, 1s, 2s, 3s, 5s, 7s, 7s, 3s, 4s, 6s, 7s, 8s.)

Anything else we need to know:
It looks the checks are also slow in successful runs, it's just that they're not quite as slow, so they pass.

I'm not sure if the problem is "something is making the cluster slow and so this totally valid test is taking too long and failing" or "e2e-gce has grown over time and now cluster load has reached the point where this formerly-passing test now has unreasonable performance expectations".

/sig network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.