Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing test] [sig-network] Firewall rule [Slow] [Serial] should create valid firewall rules for LoadBalancer type service #74887

Closed
mariantalla opened this Issue Mar 4, 2019 · 16 comments

Comments

@mariantalla
Copy link
Contributor

commented Mar 4, 2019

Which jobs are failing:
ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new in

Which test(s) are failing:
[sig-network] Firewall rule [Slow] [Serial] should create valid firewall rules for LoadBalancer type service

Since when has it been failing:
2019-03-01, test-infra:e858a8b2e

(gce-new-master-upgrade-cluster-new shows fa9347840 as the first commit for which the test failed, but fa9347840 came after e858a8b2e).

Testgrid link:

Reason for failure:

error waiting for firewall k8s-0db080b12d134e12-node-http-hc exist=false

/sig testing
/sig network
/kind failing-test
/priority critical-urgent
/milestone v1.14

cc @smourapina @alejandrox1 @kacole2 @mortent

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

It appears that the test is failing on waiting for the node health check firewall rule (which is shared among external Load Balancers) to be deleted.

I0305 12:51:54.772] [It] [Slow] [Serial] should create valid firewall rules for LoadBalancer type service
I0305 12:51:54.772] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/firewall.go:55
I0305 12:51:54.772] STEP: Getting cluster ID
I0305 12:51:54.814] Mar 5 12:51:54.814: INFO: Got cluster ID: 8c3d017620d40edc
I0305 12:51:54.899] STEP: Creating a LoadBalancer type service with ExternalTrafficPolicy=Global
I0305 12:51:54.899] STEP: creating a service firewall-test-2403/firewall-test-loadbalancer with type=LoadBalancer
I0305 12:51:54.960] STEP: waiting for loadbalancer for service firewall-test-2403/firewall-test-loadbalancer
I0305 12:51:54.961] Mar 5 12:51:54.960: INFO: Waiting up to 20m0s for service "firewall-test-loadbalancer" to have a LoadBalancer
I0305 12:52:33.047] STEP: Checking if service's firewall rule is correct
I0305 12:52:33.269] STEP: Checking if service's nodes health check firewall rule is correct
I0305 12:52:33.409] STEP: Updating LoadBalancer service to ExternalTrafficPolicy=Local
I0305 12:52:33.519] STEP: Waiting for the nodes health check firewall rule to be deleted
I0305 12:52:33.520] Mar 5 12:52:33.519: INFO: Waiting up to 15m0s for firewall k8s-8c3d017620d40edc-node-http-hc exist=false
<----------------------- failed here ----------------------->
I0305 13:07:34.184] STEP: Waiting for the local traffic health check firewall rule to be deleted
I0305 13:07:34.184] Mar 5 13:07:34.183: INFO: Waiting up to 15m0s for firewall k8s-a7460660c3f4511e9996a42010a8a000-http-hc exist=false
I0305 13:08:14.553] [AfterEach] [sig-network] Firewall rule

Also note that this test doesn't fail in non-upgrade jobs, e.g. https://k8s-testgrid.appspot.com/sig-network-gce#gci-gce-serial&width=20

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

My current suspect is that the LB service created by the upgrade test wasn't properly deleted before starting this "serial" test. If any other LB service exists (other than the ones created by this test), the node health check firewall rule will not be removed.

@mariantalla mariantalla moved this from New to Under investigation in 1.15 CI Signal Mar 5, 2019

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

cc @grayluck to see if he can help :)

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Dig a bit with @grayluck on this. We found that the upgrade test itself paniced in the middle and exited during the test, hence didn't clean up all the resources (including LoadBalancer type services) it created. Unexpectedly, it continued to run the rest of the tests (including this serial firewall test, which assumes no other LB service exists) and failed.

Some relevant logs from https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/2135:

I0305 07:19:31.276] Mar 5 07:19:31.276: INFO: Trying to get logs from node bootstrap-e2e-minion-group-x74g pod pod-configmap-f8818e51-3f16-11e9-a307-ba0b324b55ca container configmap-env-test: <nil>
I0305 07:19:31.308] STEP: delete the pod
I0305 07:19:31.326] STEP: delete the pod
I0305 07:19:31.365] Mar 5 07:19:31.365: INFO: Waiting for pod pod-secrets-f8818435-3f16-11e9-a307-ba0b324b55ca to disappear
I0305 07:19:31.391] Mar 5 07:19:31.391: INFO: Waiting for pod pod-configmap-f8818e51-3f16-11e9-a307-ba0b324b55ca to disappear
I0305 07:19:31.406] Mar 5 07:19:31.406: INFO: Pod pod-secrets-f8818435-3f16-11e9-a307-ba0b324b55ca no longer exists
I0305 07:19:31.406] fatal error: sync: inconsistent mutex state
I0305 07:19:31.408]
I0305 07:19:31.408] goroutine 296 [running]:
I0305 07:19:31.408] runtime.throw(0x4b0a293, 0x1e)
I0305 07:19:31.408] /usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc001f17e30 sp=0xc001f17e00 pc=0x430712
I0305 07:19:31.409] sync.throw(0x4b0a293, 0x1e)
I0305 07:19:31.409] /usr/local/go/src/runtime/panic.go:603 +0x35 fp=0xc001f17e50 sp=0xc001f17e30 pc=0x430695
I0305 07:19:31.409] sync.(*Mutex).Lock(0xc001fd2030)
I0305 07:19:31.409] /usr/local/go/src/sync/mutex.go:121 +0x1ec fp=0xc001f17e90 sp=0xc001f17e50 pc=0x4689ec
I0305 07:19:31.409] sync.(*Once).Do(0xc001fd2030, 0xc001f17ed0)
I0305 07:19:31.409] /usr/local/go/src/sync/once.go:40 +0x3b fp=0xc001f17ec0 sp=0xc001f17e90 pc=0x468bdb
I0305 07:19:31.409] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test.func1()
I0305 07:19:31.409] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:439 +0x51 fp=0xc001f17ef0 sp=0xc001f17ec0 pc=0x3906131
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test(0xc002302340, 0xc00201d740)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:455 +0x1ed fp=0xc001f17f88 sp=0xc001f17ef0 pc=0x38fefbd
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test-fm(0xc00201d740)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:435 +0x34 fp=0xc001f17fa8 sp=0xc001f17f88 pc=0x3917354
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc00201d740, 0xc001befb40)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76 fp=0xc001f17fd0 sp=0xc001f17fa8 pc=0x38c2536
I0305 07:19:31.410] runtime.goexit()
I0305 07:19:31.411] /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc001f17fd8 sp=0xc001f17fd0 pc=0x45f981
I0305 07:19:31.411] created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
I0305 07:19:31.411] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:86 +0xa7

We will do some more investigation, but would be great to have help from relevant folks to look at why the upgrade test itself hits fatal error @mariantalla

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Similar findings from @msau42 on #74890 (comment) as well.

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Seems like we should use #74893 to track the upgrade test failure.

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

@MrHohn while the underlying issue gets fixed, is there another job we can look at that covers the same behavior and upgrade path? e.g. something from sig-network's dashboards perhaps?

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@mariantalla From sig-network dashboards we run the same test but that doesn't trigger the upgrade path:
https://k8s-testgrid.appspot.com/sig-network-gce#gci-gce-serial.

This test is passing in another upgrade job though: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-parallel&include-filter-by-regex=firewall

@soggiest

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

Hello! We are in code freeze for 1.14. It looks like investigation is still underway for this issue, will this issue be resolved in the next week? If this is a non-release blocking issue can we move it to 1.15?

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

@soggiest we are tracking this issueunder milestone v1.14 because these are failures in master-upgrade.

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

@MrHohn I see the tests clearing up in both of the aforementioned jobs but no reference to PRs referencing this issue. Did something else happened? 🤔

@alejandrox1 alejandrox1 moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 8, 2019

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

@MrHohn I see the tests clearing up in both of the aforementioned jobs but no reference to PRs referencing this issue. Did something else happened? thinking

I'm guessing it will flake again --- still seeing the same error on some latest runs.

@alejandrox1 alejandrox1 moved this from Open PR-wait for >5 successes before "Resolved" to Under investigation (prioritized) in 1.15 CI Signal Mar 9, 2019

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

The workaround for #74890 seems to have worked and this test started passing. Will wait for it to become stable.

@spiffxp

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

It appears go1.12.1 may have fixed this as well, moving to observation

@spiffxp spiffxp moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 16, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

@spiffxp: Closing this issue.

In response to this:

/close
Calling this resolved
https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=gce&test=should%20create%20valid%20firewall%20rules%20for%20LoadBalancer%20type%20service
Screen Shot 2019-03-18 at 1 44 34 PM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp moved this from Open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 11) in 1.15 CI Signal Mar 18, 2019

@spiffxp spiffxp moved this from Resolved (week Mar 11) to Resolved (week Mar 18) in 1.15 CI Signal Mar 18, 2019

@alejandrox1 alejandrox1 moved this from Resolved (week Mar 18) to Resolved (>2 weeks old) in 1.15 CI Signal Apr 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.