Nodes are not removed after deleting VMs. #72499

krzysztof-jastrzebski · 2019-01-02T22:04:00Z

What happened:
Nodes are not removed after deleting VMs.

What you expected to happen:
Nodes should be deleted.

How to reproduce it (as minimally and precisely as possible):
Create cluster using HEAD build with 5 nodes. Remove 4 nodes. List nodes. There will be 4 NotReady nodes and 1 Ready. I checked that nodes were not removed after 10 minutes.

Anything else we need to know?:
The bug might be caused by #70344.

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.0.1352+a7cb03f4cfbf3b", GitCommit:"a7cb03f4cfbf3b519dc1a0090331a475abbe0321", GitTreeState:"clean", BuildDate:"2019-01-02T19:29:04Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: GKE

/kind bug

The text was updated successfully, but these errors were encountered:

krzysztof-jastrzebski · 2019-01-02T22:05:18Z

/assign andrewsykim
/assign mtaufen
/sig node

mtaufen · 2019-01-02T23:26:39Z

Could be we need to update GKE's configuration for 1.14. I'll take a look.

mtaufen · 2019-01-02T23:52:00Z

GKE still sets --cloud-provider=gce on kube-controller-manager, as far as I can tell, which means LoopMode should still be set to IncludeCloudLoops, and the cloud-specific controller @andrewsykim added in #70344 should be running, so my first guess (that it was just turned off) doesn't appear to be correct.

@krzysztof-jastrzebski can you check whether your controller-manager logs contain any messages like "failed to start cloud node lifecycle controller"?

andrewsykim · 2019-01-03T00:11:42Z

I haven't had a chance to test the changes I merged end-to-end yet given I was taking some time off for the holidays but I will check if this is reproducible in other cloud providers to gather more data. Thanks for reporting @krzysztof-jastrzebski!

krzysztof-jastrzebski · 2019-01-03T07:54:36Z

I checked logs and I don't see any error with string "lifecycle". Flag is set --cloud-provider=gce.

andrewsykim · 2019-01-04T06:06:36Z

I was able to reproduce this with an out-of-tree provider as well. Will dig further and report back.

andrewsykim · 2019-01-04T06:16:14Z

@mtaufen seems like when nodes are deleted, the status of the Ready condition is actually Unknown and not False.

  - lastHeartbeatTime: 2019-01-04T06:04:03Z
    lastTransitionTime: 2019-01-04T06:04:46Z
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready

In #70344 we added a check to skip deleting nodes with Unknown node condition but the previous controller logic was to check if NodeReady != True.

andrewsykim · 2019-01-04T06:19:41Z

@krzysztof-jastrzebski can you confirm if the Ready condition on your node is also Unknown?

andrewsykim · 2019-01-04T06:44:49Z

Opened #72559 (validated with CCM on master) which would put the node deletion logic to what we had prior to #70344. I'm not sure if expecting the Ready condition to be Unknown is the correct behavior here, but the PR is there if that's what we think is best for now. Will defer to your judgement @mtaufen.

krzysztof-jastrzebski · 2019-01-04T06:48:44Z

@andrewsykim I confirm Ready condition is Unknown.

andrewsykim · 2019-01-05T22:26:15Z

@krzysztof-jastrzebski my PR merged to master, can you please test on latest master when you have a chance?

krzysztof-jastrzebski · 2019-01-07T09:52:30Z

@andrewsykim It still doesn't work.

andrewsykim · 2019-01-07T15:41:46Z

@krzysztof-jastrzebski what's the server version?

krzysztof-jastrzebski · 2019-01-07T16:57:23Z

I'm using version 1.14.0-alpha.0.1475+fdf381098bd3e8-kjastrzebski-07-01-19-2 build from HEAD today (fdf3810).

andrewsykim · 2019-01-07T17:05:31Z

Are you able to confirm if the kube-controller-manager version is the same? (sorry, not super familiar with how GKE is setup)

krzysztof-jastrzebski · 2019-01-07T17:34:22Z

Yes, version is the same. You can download my version from:
gsutil ls gs://kubernetes-release-gke-internal/devel/v1.14.0-alpha.0.1475+fdf381098bd3e8-kjastrzebski-07-01-19-2

andrewsykim · 2019-01-07T18:04:20Z

Thanks @krzysztof-jastrzebski. I tested this version on an out-of-tree cloud provider and it works as expected (they run the same controller). I'll try to get a GKE environment setup to debug this further (might take a few days). If you have any new logs (specifically from node_lifecycle_controller.go) from when you updated to the latest version that would also be super helpful.

krzysztof-jastrzebski · 2019-01-09T21:16:31Z

I checked logs and now controller tries to delete pod but there is an error:
node_lifecycle_controller.go:194] unable to delete node "gke-cluster-5-default-pool-49d43f11-9lr6": nodes "gke-cluster-5-default-pool-49d43f11-9lr6" is forbidden: User "system:serviceaccount:kube-system:cloud-node-lifecycle-controller" cannot delete resource "nodes" in API group "" at the cluster scope

andrewsykim · 2019-01-09T23:54:28Z

This makes sense because we didn't update bootstrap RBAC rules for cloud-node-lifecycle-controller. I'll have a PR for this soon. Thank you @krzysztof-jastrzebski!

andrewsykim · 2019-01-10T05:56:57Z

Should be fixed in #72764.

andrewsykim · 2019-01-15T01:23:31Z

@krzysztof-jastrzebski #72764 merged, are you able to test it one more time please? :)

krzysztof-jastrzebski · 2019-01-15T08:51:54Z

Now it works.

andrewsykim · 2019-01-15T15:06:52Z

Thank you for testing @krzysztof-jastrzebski!

/close

k8s-ci-robot · 2019-01-15T15:06:54Z

@andrewsykim: Closing this issue.

In response to this:

Thank you for testing @krzysztof-jastrzebski!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019

k8s-ci-robot assigned andrewsykim and mtaufen Jan 2, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019

andrewsykim mentioned this issue Jan 4, 2019

Delete non-existent cloud provider nodes with Ready condition Unknown #72559

Merged

andrewsykim mentioned this issue Jan 10, 2019

Use node-controller cluster role for node-lifecycle and cloud-node-lifecycle controller #72764

Merged

k8s-ci-robot closed this as completed Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes are not removed after deleting VMs. #72499

Nodes are not removed after deleting VMs. #72499

krzysztof-jastrzebski commented Jan 2, 2019

krzysztof-jastrzebski commented Jan 2, 2019

mtaufen commented Jan 2, 2019

mtaufen commented Jan 2, 2019

andrewsykim commented Jan 3, 2019 •

edited

krzysztof-jastrzebski commented Jan 3, 2019

andrewsykim commented Jan 4, 2019

andrewsykim commented Jan 4, 2019

andrewsykim commented Jan 4, 2019 •

edited

andrewsykim commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

andrewsykim commented Jan 5, 2019

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019 •

edited

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019 •

edited

krzysztof-jastrzebski commented Jan 9, 2019

andrewsykim commented Jan 9, 2019 •

edited

andrewsykim commented Jan 10, 2019

andrewsykim commented Jan 15, 2019

krzysztof-jastrzebski commented Jan 15, 2019

andrewsykim commented Jan 15, 2019

k8s-ci-robot commented Jan 15, 2019

Nodes are not removed after deleting VMs. #72499

Nodes are not removed after deleting VMs. #72499

Comments

krzysztof-jastrzebski commented Jan 2, 2019

krzysztof-jastrzebski commented Jan 2, 2019

mtaufen commented Jan 2, 2019

mtaufen commented Jan 2, 2019

andrewsykim commented Jan 3, 2019 • edited

krzysztof-jastrzebski commented Jan 3, 2019

andrewsykim commented Jan 4, 2019

andrewsykim commented Jan 4, 2019

andrewsykim commented Jan 4, 2019 • edited

andrewsykim commented Jan 4, 2019

krzysztof-jastrzebski commented Jan 4, 2019

andrewsykim commented Jan 5, 2019

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019 • edited

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019

krzysztof-jastrzebski commented Jan 7, 2019

andrewsykim commented Jan 7, 2019 • edited

krzysztof-jastrzebski commented Jan 9, 2019

andrewsykim commented Jan 9, 2019 • edited

andrewsykim commented Jan 10, 2019

andrewsykim commented Jan 15, 2019

krzysztof-jastrzebski commented Jan 15, 2019

andrewsykim commented Jan 15, 2019

k8s-ci-robot commented Jan 15, 2019

andrewsykim commented Jan 3, 2019 •

edited

andrewsykim commented Jan 4, 2019 •

edited

andrewsykim commented Jan 7, 2019 •

edited

andrewsykim commented Jan 7, 2019 •

edited

andrewsykim commented Jan 9, 2019 •

edited