Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing test] [sig-network] Firewall rule [Slow] [Serial] should create valid firewall rules for LoadBalancer type service #74887

Closed
mariantalla opened this issue Mar 4, 2019 · 16 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/unresolved Indicates an issue that can not or will not be resolved.
Milestone

Comments

@mariantalla
Copy link
Contributor

mariantalla commented Mar 4, 2019

Which jobs are failing:
ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new in

Which test(s) are failing:
[sig-network] Firewall rule [Slow] [Serial] should create valid firewall rules for LoadBalancer type service

Since when has it been failing:
2019-03-01, test-infra:e858a8b2e

(gce-new-master-upgrade-cluster-new shows fa9347840 as the first commit for which the test failed, but fa9347840 came after e858a8b2e).

Testgrid link:

Reason for failure:

error waiting for firewall k8s-0db080b12d134e12-node-http-hc exist=false

/sig testing
/sig network
/kind failing-test
/priority critical-urgent
/milestone v1.14

cc @smourapina @alejandrox1 @kacole2 @mortent

@mariantalla mariantalla added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Mar 4, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 4, 2019
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 4, 2019
@MrHohn
Copy link
Member

MrHohn commented Mar 5, 2019

It appears that the test is failing on waiting for the node health check firewall rule (which is shared among external Load Balancers) to be deleted.

I0305 12:51:54.772] [It] [Slow] [Serial] should create valid firewall rules for LoadBalancer type service
I0305 12:51:54.772] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/firewall.go:55
I0305 12:51:54.772] STEP: Getting cluster ID
I0305 12:51:54.814] Mar 5 12:51:54.814: INFO: Got cluster ID: 8c3d017620d40edc
I0305 12:51:54.899] STEP: Creating a LoadBalancer type service with ExternalTrafficPolicy=Global
I0305 12:51:54.899] STEP: creating a service firewall-test-2403/firewall-test-loadbalancer with type=LoadBalancer
I0305 12:51:54.960] STEP: waiting for loadbalancer for service firewall-test-2403/firewall-test-loadbalancer
I0305 12:51:54.961] Mar 5 12:51:54.960: INFO: Waiting up to 20m0s for service "firewall-test-loadbalancer" to have a LoadBalancer
I0305 12:52:33.047] STEP: Checking if service's firewall rule is correct
I0305 12:52:33.269] STEP: Checking if service's nodes health check firewall rule is correct
I0305 12:52:33.409] STEP: Updating LoadBalancer service to ExternalTrafficPolicy=Local
I0305 12:52:33.519] STEP: Waiting for the nodes health check firewall rule to be deleted
I0305 12:52:33.520] Mar 5 12:52:33.519: INFO: Waiting up to 15m0s for firewall k8s-8c3d017620d40edc-node-http-hc exist=false
<----------------------- failed here ----------------------->
I0305 13:07:34.184] STEP: Waiting for the local traffic health check firewall rule to be deleted
I0305 13:07:34.184] Mar 5 13:07:34.183: INFO: Waiting up to 15m0s for firewall k8s-a7460660c3f4511e9996a42010a8a000-http-hc exist=false
I0305 13:08:14.553] [AfterEach] [sig-network] Firewall rule

Also note that this test doesn't fail in non-upgrade jobs, e.g. https://k8s-testgrid.appspot.com/sig-network-gce#gci-gce-serial&width=20

@MrHohn
Copy link
Member

MrHohn commented Mar 5, 2019

My current suspect is that the LB service created by the upgrade test wasn't properly deleted before starting this "serial" test. If any other LB service exists (other than the ones created by this test), the node health check firewall rule will not be removed.

@mariantalla mariantalla moved this from New to Under investigation in CI Signal team (SIG Release) Mar 5, 2019
@MrHohn
Copy link
Member

MrHohn commented Mar 5, 2019

cc @grayluck to see if he can help :)

@MrHohn
Copy link
Member

MrHohn commented Mar 6, 2019

Dig a bit with @grayluck on this. We found that the upgrade test itself paniced in the middle and exited during the test, hence didn't clean up all the resources (including LoadBalancer type services) it created. Unexpectedly, it continued to run the rest of the tests (including this serial firewall test, which assumes no other LB service exists) and failed.

Some relevant logs from https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new/2135:

I0305 07:19:31.276] Mar 5 07:19:31.276: INFO: Trying to get logs from node bootstrap-e2e-minion-group-x74g pod pod-configmap-f8818e51-3f16-11e9-a307-ba0b324b55ca container configmap-env-test: <nil>
I0305 07:19:31.308] STEP: delete the pod
I0305 07:19:31.326] STEP: delete the pod
I0305 07:19:31.365] Mar 5 07:19:31.365: INFO: Waiting for pod pod-secrets-f8818435-3f16-11e9-a307-ba0b324b55ca to disappear
I0305 07:19:31.391] Mar 5 07:19:31.391: INFO: Waiting for pod pod-configmap-f8818e51-3f16-11e9-a307-ba0b324b55ca to disappear
I0305 07:19:31.406] Mar 5 07:19:31.406: INFO: Pod pod-secrets-f8818435-3f16-11e9-a307-ba0b324b55ca no longer exists
I0305 07:19:31.406] fatal error: sync: inconsistent mutex state
I0305 07:19:31.408]
I0305 07:19:31.408] goroutine 296 [running]:
I0305 07:19:31.408] runtime.throw(0x4b0a293, 0x1e)
I0305 07:19:31.408] /usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc001f17e30 sp=0xc001f17e00 pc=0x430712
I0305 07:19:31.409] sync.throw(0x4b0a293, 0x1e)
I0305 07:19:31.409] /usr/local/go/src/runtime/panic.go:603 +0x35 fp=0xc001f17e50 sp=0xc001f17e30 pc=0x430695
I0305 07:19:31.409] sync.(*Mutex).Lock(0xc001fd2030)
I0305 07:19:31.409] /usr/local/go/src/sync/mutex.go:121 +0x1ec fp=0xc001f17e90 sp=0xc001f17e50 pc=0x4689ec
I0305 07:19:31.409] sync.(*Once).Do(0xc001fd2030, 0xc001f17ed0)
I0305 07:19:31.409] /usr/local/go/src/sync/once.go:40 +0x3b fp=0xc001f17ec0 sp=0xc001f17e90 pc=0x468bdb
I0305 07:19:31.409] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test.func1()
I0305 07:19:31.409] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:439 +0x51 fp=0xc001f17ef0 sp=0xc001f17ec0 pc=0x3906131
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test(0xc002302340, 0xc00201d740)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:455 +0x1ed fp=0xc001f17f88 sp=0xc001f17ef0 pc=0x38fefbd
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test-fm(0xc00201d740)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:435 +0x34 fp=0xc001f17fa8 sp=0xc001f17f88 pc=0x3917354
I0305 07:19:31.410] k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc00201d740, 0xc001befb40)
I0305 07:19:31.410] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76 fp=0xc001f17fd0 sp=0xc001f17fa8 pc=0x38c2536
I0305 07:19:31.410] runtime.goexit()
I0305 07:19:31.411] /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc001f17fd8 sp=0xc001f17fd0 pc=0x45f981
I0305 07:19:31.411] created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
I0305 07:19:31.411] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:86 +0xa7

We will do some more investigation, but would be great to have help from relevant folks to look at why the upgrade test itself hits fatal error @mariantalla

@MrHohn
Copy link
Member

MrHohn commented Mar 6, 2019

Similar findings from @msau42 on #74890 (comment) as well.

@MrHohn
Copy link
Member

MrHohn commented Mar 6, 2019

Seems like we should use #74893 to track the upgrade test failure.

@mariantalla
Copy link
Contributor Author

@MrHohn while the underlying issue gets fixed, is there another job we can look at that covers the same behavior and upgrade path? e.g. something from sig-network's dashboards perhaps?

@MrHohn
Copy link
Member

MrHohn commented Mar 8, 2019

@mariantalla From sig-network dashboards we run the same test but that doesn't trigger the upgrade path:
https://k8s-testgrid.appspot.com/sig-network-gce#gci-gce-serial.

This test is passing in another upgrade job though: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-parallel&include-filter-by-regex=firewall

@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@soggiest
Copy link

soggiest commented Mar 8, 2019

Hello! We are in code freeze for 1.14. It looks like investigation is still underway for this issue, will this issue be resolved in the next week? If this is a non-release blocking issue can we move it to 1.15?

@alejandrox1
Copy link
Contributor

@soggiest we are tracking this issueunder milestone v1.14 because these are failures in master-upgrade.

@alejandrox1
Copy link
Contributor

@MrHohn I see the tests clearing up in both of the aforementioned jobs but no reference to PRs referencing this issue. Did something else happened? 🤔

@alejandrox1 alejandrox1 moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in CI Signal team (SIG Release) Mar 8, 2019
@MrHohn
Copy link
Member

MrHohn commented Mar 9, 2019

@MrHohn I see the tests clearing up in both of the aforementioned jobs but no reference to PRs referencing this issue. Did something else happened? thinking

I'm guessing it will flake again --- still seeing the same error on some latest runs.

@alejandrox1 alejandrox1 moved this from Open PR-wait for >5 successes before "Resolved" to Under investigation (prioritized) in CI Signal team (SIG Release) Mar 9, 2019
@MrHohn
Copy link
Member

MrHohn commented Mar 14, 2019

The workaround for #74890 seems to have worked and this test started passing. Will wait for it to become stable.

@spiffxp
Copy link
Member

spiffxp commented Mar 16, 2019

It appears go1.12.1 may have fixed this as well, moving to observation

@spiffxp spiffxp moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in CI Signal team (SIG Release) Mar 16, 2019
@spiffxp
Copy link
Member

spiffxp commented Mar 18, 2019

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this resolved
https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=gce&test=should%20create%20valid%20firewall%20rules%20for%20LoadBalancer%20type%20service
Screen Shot 2019-03-18 at 1 44 34 PM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp moved this from Open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 11) in CI Signal team (SIG Release) Mar 18, 2019
@spiffxp spiffxp moved this from Resolved (week Mar 11) to Resolved (week Mar 18) in CI Signal team (SIG Release) Mar 18, 2019
@alejandrox1 alejandrox1 moved this from Resolved (week Mar 18) to Resolved (>2 weeks old) in CI Signal team (SIG Release) Apr 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

7 participants