Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubectl drain leads to downtime even with a PodDisruptionBudget #48307

Closed
gg7 opened this issue Jun 30, 2017 · 26 comments
Closed

kubectl drain leads to downtime even with a PodDisruptionBudget #48307

gg7 opened this issue Jun 30, 2017 · 26 comments
Labels
area/admin Indicates an issue on admin area. area/app-lifecycle area/node-lifecycle Issues or PRs related to Node lifecycle kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@gg7
Copy link

gg7 commented Jun 30, 2017

/kind bug

What happened:

I ran a demo application:

kubectl run my-nginx --image=nginx --port 80 --expose

Then I defined a PodDisruptionBudget:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: my-nginx
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-nginx

Then I executed kubectl drain --force --ignore-daemonsets --delete-local-data NODE-1. (I used --force because NODE-1 is a master node)

I monitored the pods with while true; do date; kubectl get pods -o wide; sleep 1; done

Output:

Thu Jun 29 23:36:58 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Running   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:36:59 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Running   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:00 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Running   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:01 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Running   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:03 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Running   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:04 BST 2017
NAME                       READY     STATUS        RESTARTS   AGE       IP             NODE
my-nginx-858393261-pxgh3   1/1       Terminating   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:05 BST 2017
NAME                       READY     STATUS              RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   0/1       ContainerCreating   0          1s        <none>         NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating         0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:06 BST 2017
NAME                       READY     STATUS              RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   0/1       ContainerCreating   0          2s        <none>         NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating         0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:07 BST 2017
NAME                       READY     STATUS              RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   0/1       ContainerCreating   0          4s        <none>         NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating         0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:09 BST 2017
NAME                       READY     STATUS              RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   0/1       ContainerCreating   0          5s        <none>         NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating         0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:10 BST 2017
NAME                       READY     STATUS              RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   0/1       ContainerCreating   0          6s        <none>         NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating         0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:11 BST 2017
NAME                       READY     STATUS        RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   1/1       Running       0          7s        10.y.y.y       NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:12 BST 2017
NAME                       READY     STATUS        RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   1/1       Running       0          8s        10.y.y.y       NODE-2
my-nginx-858393261-pxgh3   1/1       Terminating   0          16d       10.x.x.x       NODE-1
Thu Jun 29 23:37:13 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   1/1       Running   0          9s        10.y.y.y       NODE-2
Thu Jun 29 23:37:14 BST 2017
NAME                       READY     STATUS    RESTARTS   AGE       IP             NODE
my-nginx-858393261-pdr90   1/1       Running   0          12s       10.y.y.y       NODE-2
Thu Jun 29 23:37:17 BST 2017

What you expected to happen:

I expected Kubernetes to

  1. Schedule a new my-nginx pod on another node
  2. Wait for it to become ready
  3. In parallel (see Lost requests when doing a rolling update #43576):
    3a. Update the service to send traffic to the new pod
    3b. Terminate the pod on the node that's being drained

I think the PodDisruptionBudget didn't have any effect. I ran another test without it and I ended up with a single, unready pod. I believe that happened because pulling the nginx image took longer on the third machine.

Anyway, it should be possible for me to drain master/worker nodes without downtime without

  • having to run 2+ replicas of everything with the correct anti-affinity annotations, or
  • having to specify any PDBs (Kubernetes should assume minAvailable: 1)

If the issue is from using --force then administrators need a better way of draining master nodes.

How to reproduce it (as minimally and precisely as possible):

See above, it's 3 commands.

Environment:

  • Kubernetes version: v1.6.5 (client and nodes)
  • Cloud provider or hardware configuration**: Bare-metal servers
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): 4.4.0-*
  • Install tools: custom, from scratch installation
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 30, 2017
@k8s-github-robot
Copy link

@gg7 There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
e.g., @kubernetes/sig-api-machinery-* for API Machinery
(2) specifying the label manually: /sig <label>
e.g., /sig scalability for sig/scalability

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 30, 2017
@gg7
Copy link
Author

gg7 commented Jun 30, 2017

@kubernetes/sig-cluster-ops-*
@kubernetes/sig-node-*

/area node-lifecycle
/area app-lifecycle
/area admin

@k8s-ci-robot k8s-ci-robot added area/node-lifecycle Issues or PRs related to Node lifecycle area/app-lifecycle area/admin Indicates an issue on admin area. labels Jun 30, 2017
@foxish
Copy link
Contributor

foxish commented Jun 30, 2017

@kubernetes/sig-apps-bugs

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. labels Jun 30, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 30, 2017
@foxish
Copy link
Contributor

foxish commented Jun 30, 2017

If you launched your deployment using kubectl run my-nginx --image=nginx --port 80 --expose as you mentioned, it would lead to the pod having label: run: my-nginx, and not app: my-nginx as you used in your PDB selector. (see pod spec using kubectl get pod <name of pod> -o yaml.

The PDB selector needs to match the pod's label(s) in order for it to take effect.
One way to verify that your PDB is working as intended is to check its status from kubectl get pdb my-nginx -o yaml which should look like the following.

  status:
    currentHealthy: 1
    desiredHealthy: 1
    disruptedPods: null
    disruptionsAllowed: 0
    expectedPods: 1
    observedGeneration: 1

@0xmichalis
Copy link
Contributor

Anyway, it should be possible for me to drain master/worker nodes without downtime without having to run 2+ replicas of everything with the correct anti-affinity annotations

Single-pod deployments are by definition not HA - k8s can do nothing today about them. Not sure if this is a bug or a feature request.

@gg7
Copy link
Author

gg7 commented Jun 30, 2017

@foxish Good point, thanks!

I've changed the PDB:

kubectl get pdb -o yaml my-nginx
[...]
status:
  currentHealthy: 1
  desiredHealthy: 1
  disruptedPods: null
  disruptionsAllowed: 0
  expectedPods: 1
  observedGeneration: 1

I also added spec.minReadySeconds = 10 to the my-nginx deployment.

Now kubectl drain gets stuck:

george@george:~$ kubectl drain --grace-period=60 --force --ignore-daemonsets --delete-local-data s12-4
node "s12-4" cordoned
WARNING: Ignoring DaemonSet-managed pods: nginx-ingress-lb-jzn4t, fluentd-elasticsearch-5cqr8, [...]

This was executed 5+ minutes ago and there's still a single my-nginx pod running on the cordoned node.

@gg7
Copy link
Author

gg7 commented Jun 30, 2017

@Kargakis

Anyway, it should be possible for me to drain master/worker nodes without downtime without having to run 2+ replicas of everything with the correct anti-affinity annotations

Single-pod deployments are by definition not HA - k8s can do nothing today about them. Not sure if this is a bug or a feature request.

Kubernetes knows how to deploy single-pod applications with no downtime with a rolling update. I'm not expecting HA in case of a server crash, but I expect kubectl drain to be more intelligent. You can consider this a feature request, but I'm surprised that this hasn't been implemented already. Should I open a separate issue for that?

@0xmichalis
Copy link
Contributor

Now kubectl drain gets stuck:

PDB works! :)

Kubernetes knows how to deploy single-pod applications with no downtime with a rolling update. I'm not expecting HA in case of a server crash, but I expect kubectl drain to be more intelligent. You can consider this a feature request, but I'm surprised that this hasn't been implemented already. Should I open a separate issue for that?

I don't think there is any bug here so we can use this issue as a feature request. I could see having a way to signal deployments to run surge pods and then have PDBs use the new API but I would like to read more thoughts on this.

@gg7
Copy link
Author

gg7 commented Jun 30, 2017

This was executed 5+ minutes ago and there's still a single my-nginx pod running on the cordoned node.

If I pass --timeout=90s to kubectl drain it explicitly fails:

There are pending pods when an error occurred: Drain did not complete within 1m30s
pod/my-nginx-858393261-vjcgz
error: Drain did not complete within 1m30s

@gjcarneiro
Copy link

Single-pod deployments are by definition not HA - k8s can do nothing today about them. Not sure if this is a bug or a feature request.

One thing is highly available in case of a hardware problem. Yes, in this case if the machine dies you get downtime if you only have one replica. Server hardware failures are rare enough that we can live with a couple of minutes of downtime when they happen.

Another thing is a planned maintenance on a node: it should be fairly simple to make sure extra pods are started elsewhere before we shut down this node for maintenance. I mean, it's not rocket science, is it...

@0xmichalis
Copy link
Contributor

Another thing is a planned maintenance on a node: it should be fairly simple to make sure extra pods are started elsewhere before we shut down this node for maintenance. I mean, it's not rocket science, is it...

No assumptions are made regarding availability on single-pod controllers with the API today. What if the extra pod that you want to surge by default violates the quota given to a user? Are we going to block a cluster upgrade on such a scenario? It's unlike that most admins would like that. Opening a proposal in the community repo of an open source project is not science either :)

@gjcarneiro
Copy link

Yes, I realise I was a bit condescending in my last remark. Apologies for that.

if the extra pod that you want to surge by default violates the quota given to a user? Are we going to block a cluster upgrade on such a scenario

Well, I would say, at least attempt to create the extra pod; if it fails, it fails, give up after some time, but at least we tried, and it probably will work in most cases.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018
@kow3ns
Copy link
Member

kow3ns commented Feb 27, 2018

/close

@AshMartian
Copy link

Googler stumbling across this with same question as OP, and can't find a work around.

Is there a way to accomplish this? I have a scenario that due to ram constraints, dev/test environments can't afford to be max HA and each service would be good with the PDB minAvailable: 1, while deployment and horizontal autoscaler set to 1 replica. kubectl drain ideally should be able to scale the deployment up, wait until the pod is ready, then delete old pod. These pods take up to 4 minutes to restart, so the pdb is needed.

@juliohm1978
Copy link

We are using a custom bash script that implements an alternative to kubectl drain.

https://gist.github.com/juliohm1978/1f24f9259399e1e1edf092f1e2c7b089

Not a perfect solution, but it really helps when most of your deployments are single Pods with a rollout strategy like the following:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

@bjorn-ali-goransson
Copy link

Single-pod deployments are by definition not HA

I assume this means High Availability. I'm a bit of a noob in this terminology.

Wouldn't it be nice to have a kubectl drain --safe where you don't shut down a pod without first having started (and readyness probed) it on another node?

Perhaps this is one of those "this feature contradicts in every way the architecture of the system" requests, like my favorite application feature request "making an 'impersonate user' feature can't be that hard, can it?" ...

@alex88
Copy link

alex88 commented Jul 22, 2020

So is it expected that deleting a pod or draining a node causes downtime even if you have rolling update in place?
Is there a workaround on this? As we use spotinst which drains nodes often

@bjorn-ali-goransson
Copy link

unfortunately with 1 replica - yes

@shibumi
Copy link

shibumi commented Nov 10, 2020

Hi, just another random Kubernetes user here. Sad to see that this is still an issue in 2020.
Can we revive this issue here? I would really love to have this feature implemented in kubectl drain. I think the new kubectl rollout restart functionality could be the game changer for drain here.

@bjorn-ali-goransson
Copy link

Agreed, the team should admit that a lot of people only have resources for 1 replica and thus not reaching the treshold of HA. Why can't we cater for them as well?

@nikskiz
Copy link

nikskiz commented Feb 17, 2021

Unfortunately, the only workaround is to cordon the node and then do a rolling restart of the deployments that have pods running on the node. Once complete, drain the node.

@vasily-22
Copy link

A flag to kubectl drain could indicate the user wants a rollout for all deployments to be initiated instead of evicting the pods. This will work with pods with PDB of 0 and can avoid downtime with single-pod deployments.

@zchenyu
Copy link

zchenyu commented Oct 19, 2023

Any update on this? This strongly affects GPU workloads which often cannot run with HA due to cost.
The cordon+rollout workaround looks like it would work for manual drains, but things like node upgrades on managed services (GKE, EKS, etc) would still run into this issue.

@DominicWatson
Copy link

I can see the difficulty for both the kubernetes & autoscaler projects to implement this in a spec consistent way.

In the meantime, we have created a little cronjob in k8s that does some hacky bash scripting to automate the otherwise-manual safe-drain script that @juliohm1978 posted above (https://gist.github.com/juliohm1978/fcfd21b26f9431c01978)

I've put together a rough gist of our workaround here: https://gist.github.com/DominicWatson/76e393e04e9c65439c3eff948d19e25a

This is running in our staging cluster where we have a big need for autoscaling down. As we evolve it and make it more sophisticated, I'll try to update the gist. Feedback and improvements welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admin Indicates an issue on admin area. area/app-lifecycle area/node-lifecycle Issues or PRs related to Node lifecycle kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests