Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All GKE upgrade tests failing since 10/29 #70445

Closed
jberkus opened this Issue Oct 30, 2018 · 12 comments

Comments

@jberkus
Copy link

jberkus commented Oct 30, 2018

Starting last night (Pacific time), all GKE upgrade tests started failing. Many of them were very flaky before that, but even the non-flaky ones have been consistently failing since 13:00 on Oct. 29th. Since this does not line up with any particular commit (or infra commit), I strongly suspect that it's due to a change in GKE.

Which jobs are failing:

Which test(s) are failing:

Inconsistent, does not appear to be a particular test issue. For gke-gci-master-gci-new-downgrade-cluster, for example, all of the tests are failing.

Since when has it been failing:

Oct. 29th, sometime after 13:00

/kind failing-test
/priority critical-urgent
/sig gcp

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 30, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Oct 30, 2018

Here are all the changes in this timeframe 1f3ef29...3293f02

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Oct 30, 2018

It seems that API server consistently fails health check on latest 1.11, i.e. cluster creation, upgrade and downgrade fail at this version. Investigating why.

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Oct 31, 2018

We have a root cause and fix on GKE side. It may take a day for the dashboards to clear up though.

@seans3

This comment has been minimized.

Copy link
Contributor

seans3 commented Nov 2, 2018

Any progress on this? I'm still seeing the following failure in the SIG CLI skew-cluster-stable1-kubectl-latest-gke test:

W1101 23:52:36.444]  zone: u'us-central1-c'>] finished with error: All cluster resources were brought up, but the cluster API is reporting that: component "kube-apiserver" from endpoint "gke-935eb573abb7cb6333da-408c" is unhealthy.

The first failure happened on 10/29 @13:24.

The SIG CLI testgrid is at: https://k8s-testgrid.appspot.com/sig-cli-master#skew-cluster-stable1-kubectl-latest-gke

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Nov 2, 2018

The SIG CLI testgrid is at: https://k8s-testgrid.appspot.com/sig-cli-master#skew-cluster-stable1-kubectl-latest-gke

A day was probably too optimistic:( Started to get greener around 6:30 PM PDT yesterday.

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Nov 5, 2018

Testgrids were rearranged in the meantime, new links to some of the previously-failing jobs:
https://k8s-testgrid.appspot.com/sig-release-master-upgrade-optional#gke-gci-master-gci-new-downgrade-cluster-parallel
https://k8s-testgrid.appspot.com/sig-release-1.11-all#gke-gci-1.10-gci-1.11-upgrade-master

Not exactly green, but running again. The original issue seems fixed.

/assign @jberkus

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 5, 2018

I see almost all GKE upgrade jobs fail on the same test "Master upgrade should maintain a functioning cluster"

OSS TestGrid:
https://testgrid.k8s.io/sig-release-master-upgrade-optional#gke-gci-new-gci-master-upgrade-master

Logs:
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-new-gci-master-upgrade-master/1646

Looking at the logs it seem to point to an ingress issue.

/assign @mengqiy

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Nov 6, 2018

"Master upgrade should maintain a functioning cluster" has had a successful run, will keep an eye on it and will look to close this issue when it consistently passes...

@aleksandra-malinowska - what do you look for to ensure that the original issue is fixed in test runs? (asking as some tests are failing, but very possibly due to other reasons).

@aleksandra-malinowska

This comment has been minimized.

Copy link
Contributor

aleksandra-malinowska commented Nov 7, 2018

@mariantalla all tests were failing before because cluster failed to be created at 1.11 (in which case the scenarios wouldn't even run, i.e. it failed at setup), or upgraded/downgraded to this version. This is no longer the case for most jobs, and errors in the few remaining ones are different. But we can of course keep this open for fixing/deflaking remaining tests.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Nov 14, 2018

I'd rather close it and open an new issue for deflaking, since this is not a blocking issue anymore.

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 14, 2018

@jberkus: Closing this issue.

In response to this:

I'd rather close it and open an new issue for deflaking, since this is not a blocking issue anymore.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.