Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes are not removed after deleting VMs. #72499

Closed
krzysztof-jastrzebski opened this issue Jan 2, 2019 · 24 comments
Closed

Nodes are not removed after deleting VMs. #72499

krzysztof-jastrzebski opened this issue Jan 2, 2019 · 24 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@krzysztof-jastrzebski
Copy link
Contributor

What happened:
Nodes are not removed after deleting VMs.

What you expected to happen:
Nodes should be deleted.

How to reproduce it (as minimally and precisely as possible):
Create cluster using HEAD build with 5 nodes. Remove 4 nodes. List nodes. There will be 4 NotReady nodes and 1 Ready. I checked that nodes were not removed after 10 minutes.

Anything else we need to know?:
The bug might be caused by #70344.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.0.1352+a7cb03f4cfbf3b", GitCommit:"a7cb03f4cfbf3b519dc1a0090331a475abbe0321", GitTreeState:"clean", BuildDate:"2019-01-02T19:29:04Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019
@krzysztof-jastrzebski
Copy link
Contributor Author

/assign andrewsykim
/assign mtaufen
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019
@mtaufen
Copy link
Contributor

mtaufen commented Jan 2, 2019

Could be we need to update GKE's configuration for 1.14. I'll take a look.

@mtaufen
Copy link
Contributor

mtaufen commented Jan 2, 2019

GKE still sets --cloud-provider=gce on kube-controller-manager, as far as I can tell, which means LoopMode should still be set to IncludeCloudLoops, and the cloud-specific controller @andrewsykim added in #70344 should be running, so my first guess (that it was just turned off) doesn't appear to be correct.

@krzysztof-jastrzebski can you check whether your controller-manager logs contain any messages like "failed to start cloud node lifecycle controller"?

@andrewsykim
Copy link
Member

andrewsykim commented Jan 3, 2019

I haven't had a chance to test the changes I merged end-to-end yet given I was taking some time off for the holidays but I will check if this is reproducible in other cloud providers to gather more data. Thanks for reporting @krzysztof-jastrzebski!

@krzysztof-jastrzebski
Copy link
Contributor Author

I checked logs and I don't see any error with string "lifecycle". Flag is set --cloud-provider=gce.

@andrewsykim
Copy link
Member

I was able to reproduce this with an out-of-tree provider as well. Will dig further and report back.

@andrewsykim
Copy link
Member

@mtaufen seems like when nodes are deleted, the status of the Ready condition is actually Unknown and not False.

  - lastHeartbeatTime: 2019-01-04T06:04:03Z
    lastTransitionTime: 2019-01-04T06:04:46Z
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready

In #70344 we added a check to skip deleting nodes with Unknown node condition but the previous controller logic was to check if NodeReady != True.

@andrewsykim
Copy link
Member

andrewsykim commented Jan 4, 2019

@krzysztof-jastrzebski can you confirm if the Ready condition on your node is also Unknown?

@andrewsykim
Copy link
Member

Opened #72559 (validated with CCM on master) which would put the node deletion logic to what we had prior to #70344. I'm not sure if expecting the Ready condition to be Unknown is the correct behavior here, but the PR is there if that's what we think is best for now. Will defer to your judgement @mtaufen.

@krzysztof-jastrzebski
Copy link
Contributor Author

@andrewsykim I confirm Ready condition is Unknown.

@andrewsykim
Copy link
Member

@krzysztof-jastrzebski my PR merged to master, can you please test on latest master when you have a chance?

@krzysztof-jastrzebski
Copy link
Contributor Author

@andrewsykim It still doesn't work.

@andrewsykim
Copy link
Member

andrewsykim commented Jan 7, 2019

@krzysztof-jastrzebski what's the server version?

@krzysztof-jastrzebski
Copy link
Contributor Author

I'm using version 1.14.0-alpha.0.1475+fdf381098bd3e8-kjastrzebski-07-01-19-2 build from HEAD today (fdf3810).

@andrewsykim
Copy link
Member

Are you able to confirm if the kube-controller-manager version is the same? (sorry, not super familiar with how GKE is setup)

@krzysztof-jastrzebski
Copy link
Contributor Author

Yes, version is the same. You can download my version from:
gsutil ls gs://kubernetes-release-gke-internal/devel/v1.14.0-alpha.0.1475+fdf381098bd3e8-kjastrzebski-07-01-19-2

@andrewsykim
Copy link
Member

andrewsykim commented Jan 7, 2019

Thanks @krzysztof-jastrzebski. I tested this version on an out-of-tree cloud provider and it works as expected (they run the same controller). I'll try to get a GKE environment setup to debug this further (might take a few days). If you have any new logs (specifically from node_lifecycle_controller.go) from when you updated to the latest version that would also be super helpful.

@krzysztof-jastrzebski
Copy link
Contributor Author

I checked logs and now controller tries to delete pod but there is an error:
node_lifecycle_controller.go:194] unable to delete node "gke-cluster-5-default-pool-49d43f11-9lr6": nodes "gke-cluster-5-default-pool-49d43f11-9lr6" is forbidden: User "system:serviceaccount:kube-system:cloud-node-lifecycle-controller" cannot delete resource "nodes" in API group "" at the cluster scope

@andrewsykim
Copy link
Member

andrewsykim commented Jan 9, 2019

This makes sense because we didn't update bootstrap RBAC rules for cloud-node-lifecycle-controller. I'll have a PR for this soon. Thank you @krzysztof-jastrzebski!

@andrewsykim
Copy link
Member

Should be fixed in #72764.

@andrewsykim
Copy link
Member

@krzysztof-jastrzebski #72764 merged, are you able to test it one more time please? :)

@krzysztof-jastrzebski
Copy link
Contributor Author

Now it works.

@andrewsykim
Copy link
Member

Thank you for testing @krzysztof-jastrzebski!

/close

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closing this issue.

In response to this:

Thank you for testing @krzysztof-jastrzebski!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants