Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[upgrade test failure] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #50797

Closed
ericchiang opened this issue Aug 16, 2017 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@ericchiang
Copy link
Contributor

Opening this since this seems slightly different from #46651

/cc @kubernetes/sig-node-bugs

This tests has been consistently failing on a lot of the upgrade tests:

https://k8s-testgrid.appspot.com/master-upgrade#gke-cvm-1.7-gci-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-cvm-1.7-gci-master-upgrade-cluster
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-cvm-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-cvm-master-upgrade-cluster
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-master
https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster

What's really weird is the error message most of them spit out

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:92
Expected error:
    <*errors.errorString | 0xc422322770>: {
        s: "couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.007524987s elapsed)",
    }
    couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.007524987s elapsed)
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:77

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-cvm-new-cvm-master-upgrade-cluster/128#k8sio-restart-disruptive-should-restart-all-nodes-and-ensure-all-nodes-and-pods-recover

Really hard to be clear who owns this tests, going to tag sig-node until there's further evidence otherwise.

cc @kubernetes/kubernetes-release-managers @mbohlool

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Aug 16, 2017
@ericchiang ericchiang added this to the v1.8 milestone Aug 16, 2017
@wojtek-t
Copy link
Member

kubernetes/test-infra#4086 (comment) should fix this

@wojtek-t
Copy link
Member

@krzyzacy - FYI

@krzyzacy
Copy link
Member

/assign

@krzyzacy
Copy link
Member

The tests has wrong --num-nodes are passing now, however seems the actual upgrade test is still failing.

/unassign
/assign @ericchiang
feel free to reassign to interested parties.

@k8s-ci-robot k8s-ci-robot assigned ericchiang and unassigned krzyzacy Aug 19, 2017
@ericchiang
Copy link
Contributor Author

This upgrade test is much healthier since the fix went in and those failures look related to overall test environment issues. Closing. Will open another issue if I identify anything specific about this test.

Thanks @krzyzacy !

@ericchiang
Copy link
Contributor Author

Reopening since this is still failing on on our or upgrade tests.

https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster-new

We're blocked on that test not producing logs though (#52578).

@ericchiang ericchiang reopened this Sep 18, 2017
@ericchiang ericchiang added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 18, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Labels Complete

@ericchiang

Issue label settings:

  • sig/node: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here The commands available for adding these labels are documented here

@dchen1107 dchen1107 added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Sep 18, 2017
@yujuhong
Copy link
Contributor

couldn't find -1 nodes within 20s; last error: expected to find -1 nodes but found only 3 (20.008632278s elapsed)",

The expected number of nodes "-1" was the value of framework.TestContext.CloudConfig.NumNodes
This recent PR changed how this values is set in GKE: kubernetes/test-infra#4500
There are other tests with similar failure signature too.
/cc @krzyzacy @zmerlynn

@krzyzacy
Copy link
Member

krzyzacy commented Sep 18, 2017

It was like that before my PR, @zmerlynn seems --num-nodes=3 is passed properly but it's still insufficient to fix the issue? I'll take another look.

(those test were passing after my PR got in, I think something regressed here)

@krzyzacy
Copy link
Member

/assign

@krzyzacy
Copy link
Member

Before #4500 the entire upgrade was failing (if you scroll to the right of the testgrid page, that fixed UpgradeTest row).

So this test was passing in both https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-cluster and https://k8s-testgrid.appspot.com/master-upgrade#gke-gci-1.7-gci-master-upgrade-master, @crimsonfaith91 can you double check that if there's any difference (need to be cherrypicked back to 1.7) between 1.7 and 1.8?

@krzyzacy
Copy link
Member

fixed by kubernetes/test-infra#4617
/close

@crimsonfaith91
Copy link
Contributor

Thanks, @krzyzacy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

8 participants