Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake: [sig-network] Services should have session affinity work for LoadBalancer service with ESIPP #71423

Closed
jberkus opened this Issue Nov 26, 2018 · 27 comments

Comments

@jberkus
Copy link

jberkus commented Nov 26, 2018

Which jobs are failing: gce-cos-master-slow

Which test(s) are failing: [sig-network] Services should have session affinity work for LoadBalancer service with ESIPP on [Slow] [DisabledForLargeClusters]

Since when has it been failing: This test started to be flaky on 11/18

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-slow

Reason for failure:

Looks to be a pretty straightfowards test not passing:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1648
Nov 26 10:03:00.209: Affinity should hold but didn't.
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/service_util.go:1452

If we look at triage, this test has always been flaky, but it got significantly more flaky (2X failures) on 10/27: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2018-10-29&test=Services%20should%20have%20session%20affinity%20work%20for%20LoadBalancer%20service%20with%20ESIPP

Anything else we need to know:

This job is non-blocking for 1.13, except if this represents an actual bug in Affinity the 1.13 team needs to know ASAP.

This failure may or may not be related to #52495. It looks like this test was disabled for large clusters, but maybe the problem wasn't the cluster size.

/kind flake
/sig network
/priority important-soon
attn:
@thockin @caseydavenport @dcbw

@AishSundar

This comment has been minimized.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Nov 27, 2018

Sure. But the failures have become 3X as frequent, so I still want a sign-off by sig-network.

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Nov 27, 2018

@grayluck

This comment has been minimized.

Copy link
Contributor

grayluck commented Nov 27, 2018

Ack. Will work on this.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 28, 2018

@grayluck we plan to lift code freeze for 1.13 tomorrow,11/28 5pm PST. If there is a new regression in 1.13 thats increased the flakes and needs addressing we need to get the fix in asap. Can you please let us know if this is a blocker for 1.13 by tomorrow, 11/28, morning 10am PST ?

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Nov 28, 2018

If we look at triage, this test has always been flaky, but it got significantly more flaky (2X failures) on 10/27:

Great observation. I checked on the commits around that date but wasn't able to find any change that may cause regression. Hasn't been able to reproduce the same failure locally as well.

@AishSundar

This comment has been minimized.

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Nov 29, 2018

Good catch on that, so the recent increase of flakiness likely is not caused by 1.13 changes so we shouldn't block release on this either.
cc @bowei

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Nov 29, 2018

But we will be looking deeper into the root cause.

@anfernee

This comment has been minimized.

Copy link
Member

anfernee commented Dec 7, 2018

Several observations:

  • It only happens when isTransitionTest is false. When it's true, the test does more stuff before actually check service session affinity. https://github.com/kubernetes/kubernetes/blob/5e30299ad17686/test/e2e/network/service.go#L2088
  • It doesn't happen when executing in pod, probably due to the same reason above, the test start test pod before actually testing service.
  • I wrote a script which does the same thing in my cluster of version 1.11.3-gke.18 in us-central1-a. cannot reproduce it.

The hypothesis is that the service session affinity spec takes some time to reconcile after the service is created. It could happen either on the node or on the GCLB. Since I haven't been able to reproduce it's hard to tell. But it's more likely on GCLB side, even it could be regional issue. Looks like a sleep of seconds will deflake it. but it's just hiding issues. I would rather open another issue to track the real issue.

anfernee added a commit to anfernee/kubernetes that referenced this issue Dec 13, 2018

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

anfernee added a commit to anfernee/kubernetes that referenced this issue Dec 13, 2018

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

anfernee added a commit to anfernee/kubernetes that referenced this issue Dec 13, 2018

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

anfernee added a commit to anfernee/kubernetes that referenced this issue Dec 13, 2018

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

anfernee added a commit to anfernee/kubernetes that referenced this issue Dec 14, 2018

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423
@anfernee

This comment has been minimized.

Copy link
Member

anfernee commented Dec 19, 2018

The most recent failure is here:
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/22578

Basically what happened is GCLB doesn't honor session affinity and still sends the request to different VMs. The log shows that 3 requests are routed to 3 different VMs.

LOGS

Dec 19 07:40:58.094: INFO: [service-fsn65 service-bf8nc]
Dec 19 07:40:58.094: INFO: Affinity should hold but didn't.
Dec 19 07:40:58.196: INFO: [pod,node] pairs: [{Pod:service-bf8nc Node:bootstrap-e2e-minion-group-f3nb} {Pod:service-fsn65 Node:bootstrap-e2e-minion-group-9sp7} {Pod:service-rnrcw Node:bootstrap-e2e-minion-group-m5rj}]; err:

globervinodhn added a commit to globervinodhn/kubernetes that referenced this issue Jan 10, 2019

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

YoubingLi added a commit to YoubingLi/kubernetes that referenced this issue Jan 23, 2019

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423

phenixblue added a commit to phenixblue/kubernetes that referenced this issue Jan 24, 2019

Logs [pod,node] pairs for sessionAffinity test
Logs information for fixing flakiness for kubernetes#71423
@alejandrox1

This comment has been minimized.

Copy link
Contributor

alejandrox1 commented Feb 28, 2019

More recent failures: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/1288

Logs:

I0227 05:59:19.127] [AfterEach] [sig-network] Services
I0227 05:59:19.127] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:90
I0227 05:59:19.127]
I0227 05:59:19.127] • Failure [103.407 seconds]
I0227 05:59:19.127] [sig-network] Services
I0227 05:59:19.128] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/framework.go:22
I0227 05:59:19.128] should have session affinity work for LoadBalancer service with ESIPP off [Slow] [DisabledForLargeClusters] [It]
I0227 05:59:19.128] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:1670
I0227 05:59:19.128]
I0227 05:59:19.128] Feb 27 05:58:24.807: Affinity should hold but didn't.
I0227 05:59:19.128]
I0227 05:59:19.128] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/service_util.go:1452
@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 6, 2019

Hello, could this issue be prioritized to help with stabilizing the release blocking upgrade tests?

They're currently flaking

@bowei

This comment has been minimized.

Copy link
Member

bowei commented Mar 6, 2019

/assign @MrHohn can you take a look

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 6, 2019

@bowei: GitHub didn't allow me to assign the following users: take, a, look, can, you.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @MrHohn can you take a look

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@grayluck

This comment has been minimized.

Copy link
Contributor

grayluck commented Mar 6, 2019

I am preparing a PR, basically allow Affinity some time to be configured, instead of failing out instantly.

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 8, 2019

Thanks! Will keep an eye out for it.

@mariantalla mariantalla moved this from Flakes to Under investigation (prioritized) in 1.15 CI Signal Mar 8, 2019

1.15 CI Signal automation moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" Mar 9, 2019

@alejandrox1 alejandrox1 moved this from Open PR-wait for >5 successes before "Resolved" to Resolved flakes (observe closed for a week before "Resolved") in 1.15 CI Signal Mar 9, 2019

@grayluck

This comment has been minimized.

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 13, 2019

It's stabler now, but still flaking once/twice a day, e.g. in upgrade tests.

@grayluck - Is this the same issue as you looked at? Is there more to be done here?

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 13, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 13, 2019

@mariantalla: Reopened this issue.

In response to this:

It's stabler now, but still flaking once/twice a day, e.g. in upgrade tests.

@grayluck - Is this the same issue as you looked at? Is there more to be done here?

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.15 CI Signal automation moved this from Resolved flakes (observe closed for a week before "Resolved") to Under investigation (prioritized) Mar 13, 2019

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Mar 13, 2019

@grayluck @mariantalla we should probably cherrypick #75073 into 1.13 in case of deflaking upgrade/downgrade job, because those jobs would run 1.13 tests against 1.14 (upgraded) cluster or 1.13 (downgraded) cluster.

FYI we've hit similar situation for DNS test (ref #74543 (comment)).

@grayluck

This comment has been minimized.

Copy link
Contributor

grayluck commented Mar 13, 2019

Good point.
Cherrypick PR prepared. #75341

@alejandrox1 alejandrox1 moved this from Under investigation (prioritized) to Resolved flakes (observe closed for a week before "Resolved") in 1.15 CI Signal Mar 14, 2019

@mariantalla mariantalla moved this from Resolved flakes (observe closed for a week before "Resolved") to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 14, 2019

@mariantalla mariantalla moved this from Open PR-wait for >5 successes before "Resolved" to Resolved flakes (observe closed for a week before "Resolved") in 1.15 CI Signal Mar 15, 2019

@grayluck

This comment has been minimized.

Copy link
Contributor

grayluck commented Mar 17, 2019

Thanks for the nice graph. Look like the fix works. I'll create cherrypick PRs for 1.11 and 1.12 right away.

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Apr 4, 2019

AFAICT this has been significantly improved: https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=session%20affinity
Screenshot from 2019-04-04 11-33-13

Closing this and thanks to @grayluck!
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Apr 4, 2019

@MrHohn: Closing this issue.

In response to this:

AFAICT this has been significantly improved: https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=session%20affinity
Screenshot from 2019-04-04 11-33-13

Closing this and thanks to @grayluck!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.15 CI Signal automation moved this from Resolved flakes (observe closed for a week before "Resolved") to Failed-test w/open PR-wait for >5 successes before "Resolved" Apr 4, 2019

@alejandrox1 alejandrox1 moved this from Failed-test w/open PR-wait for >5 successes before "Resolved" to Resolved (>2 weeks old) in 1.15 CI Signal Apr 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.