Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing test] gce-cos-<branch>-ingress are timing out #75186

Closed
mariantalla opened this Issue Mar 8, 2019 · 31 comments

Comments

@mariantalla
Copy link
Contributor

commented Mar 8, 2019

Which jobs are failing:

Which test(s) are failing:
The jobs are timing out, but just before that, the following tests failed:

  • [sig-network] Loadbalancing: L7 GCE [Slow] [Feature:Ingress] should be able to switch between HTTPS and HTTP2 modes
  • [sig-network] Loadbalancing: L7 GCE [Slow] [Feature:Ingress] should update ingress while sync failures occur on other ingresses
  • [sig-network] Loadbalancing: L7 GCE [Slow] [Feature:NEG] should create NEGs for all ports with the Ingress annotation, and NEGs for the standalone annotation otherwise (in gce-cos-master-ingress)
  • [sig-network] Loadbalancing: L7 GCE [Slow] [Feature:Ingress] should be able to switch between HTTPS and HTTP2 modes (in gce-cos-1.14-ingress)

Since when has it been failing:
late 2019-03-07

Testgrid link:

Reason for failure:
Timeout related to the L7 controller:

error: L7 controller failed to delete all cloud resources on time. timed out waiting for the condition

/sig network
/priority critical-urgent (failing consistently in release blocking jobs, blocking other tests entirely from running)

cc @kubernetes/sig-network-test-failures
cc @smourapina @alejandrox1 @kacole2 @mortent

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 8, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 8, 2019

@mariantalla mariantalla added this to Flakes in 1.15 CI Signal Mar 8, 2019

@mariantalla mariantalla moved this from Flakes to New (no response yet) in 1.15 CI Signal Mar 8, 2019

@nikopen

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2019

@kubernetes/sig-gcp-test-failures could this be a gcp issue?

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 11, 2019

also cc'ing the sig-network chairs; could you help to triage this please? 🙏🏻

@thockin @dcbw @caseydavenport

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

Seems like Ingress controller kept panicing during the test (https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sbeta-ingress/6367/artifacts/test-6724261826-master/glbc.log):

I0311 14:12:41.510138       1 backends.go:249] Sync: backends [{ingress-5906/echoheaders/443 32165 443 HTTPS 8443 false <nil>}]
E0311 14:12:42.100687       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/go/src/k8s.io/ingress-gce/pkg/healthchecks/healthchecks.go:332
/go/src/k8s.io/ingress-gce/pkg/healthchecks/healthchecks.go:244
/go/src/k8s.io/ingress-gce/pkg/healthchecks/healthchecks.go:106
/go/src/k8s.io/ingress-gce/pkg/backends/backends.go:220
/go/src/k8s.io/ingress-gce/pkg/backends/backends.go:283
/go/src/k8s.io/ingress-gce/pkg/backends/backends.go:252
/go/src/k8s.io/ingress-gce/pkg/controller/cluster_manager.go:98
/go/src/k8s.io/ingress-gce/pkg/controller/controller.go:349
/go/src/k8s.io/ingress-gce/pkg/controller/controller.go:299
/go/src/k8s.io/ingress-gce/pkg/controller/controller.go:111
/go/src/k8s.io/ingress-gce/pkg/utils/taskqueue.go:84
/go/src/k8s.io/ingress-gce/pkg/utils/taskqueue.go:54
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/k8s.io/ingress-gce/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/k8s.io/ingress-gce/pkg/utils/taskqueue.go:54
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1674e63]

/cc @rramkumar1

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

Seems like @freehan has filed kubernetes/ingress-gce#675.

@nikopen

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

this kubernetes/ingress-gce#678 will likely mitigate the CI failures

@cmluciano

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

Assigning to freehan and kubernetes/ingress-gce#678 's reviewer

/assign @MrHohn @freehan

@cmluciano

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

@nikopen Can we move this to "Under investigation" on the CI project board?

@nikopen

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

@mariantalla ^^

@mariantalla mariantalla moved this from New (no response yet) to Under investigation (prioritized) in 1.15 CI Signal Mar 12, 2019

@cmluciano

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

/triage unresolved

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 12, 2019

Hey @cmluciano I noticed the triage/unresolved label - does that mean that it won't be resolved (as is documented here) or that it hasn't been yet?

@cmluciano

This comment has been minimized.

Copy link
Member

commented Mar 12, 2019

@mariantalla No this is a misunderstanding on my part with what this label meant. I thought it signified "open but unassigned" .

/remove-label triage/unresolved

@nikopen

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

/remove-triage unresolved

triage labels seem to be a bit old, they should be revamped

@mariantalla mariantalla moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 13, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 16, 2019

should be able to switch between HTTPS and HTTP2 modes and diffResources are still failing consistently, nearly all other Loadbalancing: L7 GCE are flaking frequently at the moment.

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2019

@mariantalla: Reopened this issue.

In response to this:

should be able to switch between HTTPS and HTTP2 modes and diffResources are still failing consistently, nearly all other Loadbalancing: L7 GCE are flaking frequently at the moment.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 16, 2019

1.15 CI Signal automation moved this from Open PR-wait for >5 successes before "Resolved" to Under investigation (prioritized) Mar 16, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

master-blocking runs are clearing out, but 1.14-blocking seem to still include the http2 test (run #6430 is the first that includes #75411 in v1.14)

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Took a quick look, turned out job "ci-kubernetes-e2e-gce-cos-k8sbeta-ingress" doesn't skip test with the [Unreleased] tag.

The job config is auto-generated: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/generated/generated.yaml

@krzyzacy or @yguo0905, how can we add flag --ginkgo.skip=\[Unreleased\] to that job?

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

hummmmmmm when did that tag becomes a thing?!

@rramkumar1

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

@MrHohn I think it is here

https://github.com/kubernetes/test-infra/blob/master/experiment/test_config.yaml

@krzyzacy We introduced that filter for Ingress-GCE tests only because we needed a way to turn off tests on specific branches.

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Thanks both, let me send a PR to update that.

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Why not use the feature tag? Introduce a new tag that need to be skipped by default will presumably break everything :-)

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

[feature:unreleased], instead of [unreleased]

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

@krzyzacy That sounds like a reasonable approach, but it means we will need to update most of the ingress tests & jobs (more on-going PRs).

Could we update that job to skip Unreleased for now and file an issue to cleanup the tag after code freeze? ( Or maybe kubernetes/ingress-gce#667 will happen first:) )

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

I have a feeling this unrelease tag is running in other jobs.. which branch has this tag?

@MrHohn

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

@krzyzacy The unrelease tag always co-presents with [Feature:Ingress] tag. So as long as we patch those ingress jobs it should be fine. kubernetes/test-infra#11820 is updating the remaining jobs:

  • k8s-stable3
  • k8s-stable2
  • k8s-stable1
  • k8s-beta
@rramkumar1

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

+1 to MrHohn

We have the filter on every job config. Each individual test has the unreleased tag based on what branch it needs to be tested on. So if none of the actual tests have "Unreleased" nothing gets skipped.

Ref: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/ingress.go#L431

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

👍

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

Took a quick look, turned out job "ci-kubernetes-e2e-gce-cos-k8sbeta-ingress" doesn't skip test with the [Unreleased] tag.

👍 that explains a lot of things 😅
1.14-blocking is now starting to pass; with the https/http2 test removed:
image

@mariantalla mariantalla moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 18, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

All clear, closing.

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

@mariantalla: Closing this issue.

In response to this:

All clear, closing.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mariantalla mariantalla moved this from Failed-test w/open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 18) in 1.15 CI Signal Mar 20, 2019

@alejandrox1 alejandrox1 moved this from Resolved (week Mar 18) to Resolved (>2 weeks old) in 1.15 CI Signal Apr 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.