[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

ericchiang · 2017-08-16T20:38:07Z

Opening this since this seems slightly different from #46651

/cc @kubernetes/sig-node-bugs

This tests has been consistently failing on a lot of the upgrade tests:

https://k8s-testgrid.appspot.com/master-upgrade#gke-cvm-1.7-gci-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-cvm-1.7-gci-master-upgrade-cluster
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-cvm-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-cvm-master-upgrade-cluster
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster

What's really weird is the error message most of them spit out

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:92
Expected error:
    <*errors.errorString | 0xc422322770>: {
        s: "couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.007524987s elapsed)",
    }
    couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.007524987s elapsed)
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:77

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-cvm-new-cvm-master-upgrade-cluster/128#k8sio-restart-disruptive-should-restart-all-nodes-and-ensure-all-nodes-and-pods-recover

Really hard to be clear who owns this tests, going to tag sig-node until there's further evidence otherwise.

cc @kubernetes/kubernetes-release-managers @mbohlool

The text was updated successfully, but these errors were encountered:

wojtek-t · 2017-08-16T20:52:33Z

kubernetes/test-infra#4086 (comment) should fix this

wojtek-t · 2017-08-16T20:52:40Z

@krzyzacy - FYI

krzyzacy · 2017-08-16T23:02:35Z

/assign

krzyzacy · 2017-08-19T17:40:42Z

The tests has wrong --num-nodes are passing now, however seems the actual upgrade test is still failing.

/unassign
/assign @ericchiang
feel free to reassign to interested parties.

ericchiang · 2017-08-21T17:26:21Z

This upgrade test is much healthier since the fix went in and those failures look related to overall test environment issues. Closing. Will open another issue if I identify anything specific about this test.

Thanks @krzyzacy !

ericchiang · 2017-09-18T16:15:32Z

Reopening since this is still failing on on our or upgrade tests.

https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster-new

We're blocked on that test not producing logs though (#52578).

k8s-github-robot · 2017-09-18T16:16:48Z

[MILESTONENOTIFIER] Milestone Labels Complete

@ericchiang

Issue label settings:

sig/node: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Additional instructions available here The commands available for adding these labels are documented here

yujuhong · 2017-09-18T20:14:31Z

couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.008632278s elapsed)",

The expected number of nodes "-1" was the value of framework.TestContext.CloudConfig.NumNodes
This recent PR changed how this values is set in GKE: kubernetes/test-infra#4500
There are other tests with similar failure signature too.
/cc @krzyzacy @zmerlynn

krzyzacy · 2017-09-18T21:12:27Z

It was like that before my PR, @zmerlynn seems --num-nodes=3 is passed properly but it's still insufficient to fix the issue? I'll take another look.

(those test were passing after my PR got in, I think something regressed here)

krzyzacy · 2017-09-18T21:30:28Z

/assign

krzyzacy · 2017-09-18T22:08:19Z

Before #4500 the entire upgrade was failing (if you scroll to the right of the testgrid page, that fixed UpgradeTest row).

So this test was passing in both https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster and https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-master, @crimsonfaith91 can you double check that if there's any difference (need to be cherrypicked back to 1.7) between 1.7 and 1.8?

krzyzacy · 2017-09-19T00:21:29Z

fixed by kubernetes/test-infra#4617
/close

crimsonfaith91 · 2017-09-19T00:29:31Z

Thanks, @krzyzacy!

ericchiang added the kind/upgrade-test-failure label Aug 16, 2017

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Aug 16, 2017

ericchiang added this to the v1.8 milestone Aug 16, 2017

k8s-ci-robot assigned krzyzacy Aug 16, 2017

k8s-ci-robot assigned ericchiang and unassigned krzyzacy Aug 19, 2017

ericchiang closed this as completed Aug 21, 2017

ericchiang reopened this Sep 18, 2017

k8s-github-robot added the milestone-labels-incomplete label Sep 18, 2017

ericchiang added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 18, 2017

k8s-github-robot added the milestone-labels-complete label Sep 18, 2017

k8s-github-robot removed the milestone-labels-incomplete label Sep 18, 2017

dchen1107 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Sep 18, 2017

k8s-ci-robot assigned krzyzacy Sep 18, 2017

krzyzacy mentioned this issue Sep 18, 2017

[upgrade test failure] Several network partition test failing #47820

Closed

k8s-ci-robot closed this as completed Sep 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

ericchiang commented Aug 16, 2017

wojtek-t commented Aug 16, 2017

wojtek-t commented Aug 16, 2017

krzyzacy commented Aug 16, 2017

krzyzacy commented Aug 19, 2017

ericchiang commented Aug 21, 2017

ericchiang commented Sep 18, 2017

k8s-github-robot commented Sep 18, 2017

yujuhong commented Sep 18, 2017

krzyzacy commented Sep 18, 2017 •

edited

krzyzacy commented Sep 18, 2017

krzyzacy commented Sep 18, 2017

krzyzacy commented Sep 19, 2017

crimsonfaith91 commented Sep 19, 2017

[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

Comments

ericchiang commented Aug 16, 2017

wojtek-t commented Aug 16, 2017

wojtek-t commented Aug 16, 2017

krzyzacy commented Aug 16, 2017

krzyzacy commented Aug 19, 2017

ericchiang commented Aug 21, 2017

ericchiang commented Sep 18, 2017

k8s-github-robot commented Sep 18, 2017

yujuhong commented Sep 18, 2017

krzyzacy commented Sep 18, 2017 • edited

krzyzacy commented Sep 18, 2017

krzyzacy commented Sep 18, 2017

krzyzacy commented Sep 19, 2017

crimsonfaith91 commented Sep 19, 2017

krzyzacy commented Sep 18, 2017 •

edited