Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node controller to not force delete pods #35235

Merged

Conversation

foxish
Copy link
Contributor

@foxish foxish commented Oct 20, 2016

Fixes #35145

  • e2e tests to test Petset, RC, Job.
  • Remove and cover other locations where we force-delete pods within the NodeController.

Release note:

Node controller no longer force-deletes pods from the api-server.

* For StatefulSet (previously PetSet), this change means creation of replacement pods is blocked until old pods are definitely not running (indicated either by the kubelet returning from partitioned state, or deletion of the Node object, or deletion of the instance in the cloud provider, or force deletion of the pod from the api-server). This has the desirable outcome of "fencing" to prevent "split brain" scenarios.
* For all other existing controllers except StatefulSet , this has no effect on the ability of the controller to replace pods because the controllers do not reuse pod names (they use generate-name).
* User-written controllers that reuse names of pod objects should evaluate this change.

This change is Reviewable

@foxish foxish added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Oct 20, 2016
@foxish foxish added this to the v1.5 milestone Oct 20, 2016
@foxish foxish self-assigned this Oct 20, 2016
@foxish foxish force-pushed the node-controller-no-force-deletion branch from 6b7c2b6 to 2c48694 Compare October 20, 2016 21:37
@foxish foxish added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 20, 2016
@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 20, 2016
@AlexMioMio
Copy link

@foxish why we abandon this feature? i think force delete pod may be useful in some situation

@foxish
Copy link
Contributor Author

foxish commented Oct 21, 2016

@AlexMioMio, please see the discussion here #35145 and #34160, and feel free to comment on it. This is still a WIP.

@smarterclayton
Copy link
Contributor

We are not abandoning this feature, but we are making it the administrator's responsibility to decide to force delete. Node controller does not know whether the machine is partitioned and coming back or not, so if the node controller deletes those pods it can cause split brain in the cluster for pet sets.

@smarterclayton
Copy link
Contributor

With this change, if you want to delete a node (because it's no longer running), delete the node and the pods will be cleaned up. If you want to delete a pod, run kubectl delete pod NAME --grace-period=0

@foxish foxish force-pushed the node-controller-no-force-deletion branch from 2c48694 to 84e1676 Compare October 24, 2016 17:41
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2016
@foxish foxish closed this Oct 28, 2016
@foxish foxish deleted the node-controller-no-force-deletion branch October 28, 2016 17:02
@foxish foxish restored the node-controller-no-force-deletion branch October 28, 2016 18:53
@foxish foxish reopened this Oct 28, 2016
@foxish foxish force-pushed the node-controller-no-force-deletion branch from 02f5c1c to fa74c29 Compare October 28, 2016 22:11
@foxish foxish changed the title WIP: Node controller to not force delete pods Node controller to not force delete pods Oct 28, 2016
@foxish
Copy link
Contributor Author

foxish commented Oct 28, 2016

Waiting for #35581 to merge before rebase. The other parts are now ready for review.

@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 28, 2016
@foxish foxish force-pushed the node-controller-no-force-deletion branch from fa74c29 to 92f4f21 Compare October 28, 2016 22:20
@foxish foxish assigned smarterclayton and erictune and unassigned foxish Oct 28, 2016
@foxish foxish force-pushed the node-controller-no-force-deletion branch from e1a15b3 to cf9c787 Compare November 1, 2016 18:34
@foxish foxish force-pushed the node-controller-no-force-deletion branch from cf9c787 to 7194101 Compare November 1, 2016 18:46
@foxish
Copy link
Contributor Author

foxish commented Nov 1, 2016

Addressed comments by @erictune, will reapply LGTM after tests pass.

@foxish foxish added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2016
@foxish foxish added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 1, 2016
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@ozbillwang
Copy link

ozbillwang commented Feb 15, 2017

I understand the request came from #35145 and #34160, but we get side effects because of this changes.

We implement kubernetes in aws autoscaling group. kubernetes nodes are keeping scaled up and down. If the terminated nodes keep NotReady status, a lot of daemonset pods are in pending status.

If we start to disable this feature force delete pods/nodes, and it is incompatible with old version, could we have the choices to decide, so we can have force delete as options when setup.

@smarterclayton
Copy link
Contributor

That violates the safety guarantees of the cluster - if those nodes aren't coming back, you should delete the node. Can you describe exactly why the nodes are in not ready state? That seems like a flaw in the autoscaler or the cloud provider - if the instance is deleted in AWS, the cloud provider should be removing the instance from the API (the node should be deleted) which should clear the pods off the node, regardless of readiness.

@ozbillwang
Copy link

ozbillwang commented Feb 15, 2017

Thanks, @smarterclayton

We set up Kuberentes masters and nodes both with aws autoscaling group (ASG). Version is 1.4.6.

When scale down a node by ASG in aws, the node has been terminated, but the node stay in node list kubectl get nodes as NotReady status forever.

My colleague runs another kubernetes stack which is only v1.2, after the nodes are terminated by ASG as well in that environment, nodes disappear from node list in very short time.

I try to fix this issue, but failed. I think maybe the problem is related to this pull request.

@foxish
Copy link
Contributor Author

foxish commented Feb 15, 2017

When scale down a node by ASG in aws, the node has been terminated, but the node stay in node list kubectl get nodes as NotReady status forever.

I think that's the issue here. The nodes should be kept in sync with the underlying infrastructure always. The node object should be removed by either the cloud-provider specific code, or be deleted by an external loop. It shouldn't stick around indefinitely as "NotReady".
The rationale for this change is explained also in https://kubernetes.io/docs/admin/node/#condition

In your case, I'd expect the cloud provider specific code to figure out that the instance is gone and remove the node object.
/cc @justinsb for AWS

@foxish foxish deleted the node-controller-no-force-deletion branch February 15, 2017 02:56
@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 15, 2017 via email

@foxish
Copy link
Contributor Author

foxish commented Feb 15, 2017

I confirmed that the node object does get deleted when the instance is deleted in case of GCE, GKE. So, the fault here is likely in the AWS cloud provider specific code.

@foxish
Copy link
Contributor Author

foxish commented Feb 15, 2017

I guess it is also possible that the instance is being stopped, but not deleted in this case. If that is the case, the cloud provider code may still find the instance, and the NC wouldn't delete it (not sure about this). Then the responsibility lies with the cluster admin to clean up such nodes.

@gmarek
Copy link
Contributor

gmarek commented Feb 15, 2017

Maybe we should delete stopped Nodes as well?

@ozbillwang
Copy link

ozbillwang commented Feb 15, 2017

Thanks for the comments above.

I am still not sure if this is related to my environment only. Because the PR is merged to version 1.5, my kubernetes version is 1.4.6 only and should have force delete enabled.

@gmarek
Copy link
Contributor

gmarek commented Feb 15, 2017

The only thing that was change here was handling Pods on not-ready Nodes. The code that is supposed to remove Nodes when they're not present in the cloud provider was untouched (i.e. Node should disappear when corresponding VM was deleted and cloud provider starts answering with "NotFound" when asked about it). We also moved the code that was responsible for deleting Pods from not existing Nodes, but that should have been no-op.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants