Simple scheduler & controller-manager disaster recovery #112

aaronlevy · 2016-08-30T22:43:06Z

There are potential failure cases, where you have permanently lost all schedulers and/or all controller-managers, where recovery leaves you in a chicken-egg state:

For example, assume you have lost all schedulers - but still have a functioning api-server which contains the scheduler deployment object:

You need a controller-manager to convert the deployment into unscheduled pods, and you need a scheduler to then schedule those pods to nodes (scheduler to schedule the scheduler, if you will).

While these types of situations should be mitigated by deploying across failure domains, it is still something we need to cover.

In the short term this could mean documenting, for example, how to create a temporary scheduler pod pre-assigned to a node (once a single scheduler exists, it will then schedule the rest of the scheduler pods).

Another option might be to build a tool which knows how to do this for you based on parsing an existing object:

kube-recover deployment kube-controller-manager --target=<node-ip>
kube-recover deployment kube-scheduler --target=<node-ip>

Where the kube-recover tool could read the object from the api-server (or from disk), parse out the podSpec and pre-assign a pod to the target node (bypassing both need for controller-manager and scheduler).

The text was updated successfully, but these errors were encountered:

aaronlevy · 2016-08-30T22:47:15Z

Thinking longer term, rather than a separate kube-recover tool, this could be a command in kubectl which knows how to extract a pod from higher-order object -- this could be useful in the case where we have a kubelet-pod-api (so we could use this to natively push pod to kubelet api from a deployment/daemonset object rather than needing the intermediate pod state). See #97 (comment)

chancez · 2016-08-30T22:48:06Z

👍 I think an interesting idea is to play with re-running bootkube for recovery. I found that generally if I just re-run bootkube nothing too crazy happens and the temporary control plane brings up a scheduler, allowing everything to correct itself. I really like the UX of this approach, even if its not a real solution.

In the static manifest case this problem isnt really a problem because the pods which run aren't scheduled, so the scheduler failing in any scenario always results in it being restarted and able to run (assuming it can grab it's lease).

I wonder if there's a good way to do this, perhaps by check pointing the scheduler to nodes which have role=master would be a good start, as an alternative to your proposed kube-recover strategy? This would be mostly automated, the only issue is how do you run this checkpointer in a way that works when the scheduler runs into this? Well, that's probably a static pod, and that means we can't easily manage it, which is problematic if the master nodes are changing.

chancez · 2016-08-30T22:49:02Z

Also, definitely agree that a kubelet-pod-api that skips scheduling makes this story way better.

Raffo · 2016-09-28T15:26:25Z

Hi guys, I was trying bootkube and I had a similar case. In my case, I figured that the controller had no --cloud-provide=aws flag set and I tried editing the controller-manger with kubectl edit... and it was a bad idea. The result was that I can't recover the cluster cause I am not able to schedule anything.
What I'm thinking is: in this case we have etcd running on the master itself, which is just wrong, but I thought that if we run a reliable, separated etcd cluster, this problem would not exist at all. Just run a new master with bootkube and attach it to the etcd cluster and kill the broken one. Would this make sense?

aaronlevy · 2016-09-29T00:12:15Z

One low-hanging fruit is that we should be deploying multiple copies of the controller-manager/scheduler. In that case you would be doing a rolling-update of the component, verifying that the new functionality works before destroying all of the old copies.

However, there are still situations where we have a loss of all schedulers and/or controller-manager (e.g. maybe a flag change is subtly broken, but the pod is still running so the deployment manager rolls out all broken pods).

You could launch a new master as an option, but if you still have an api-server/etcd running you should be able to recover. Essentially you would need to inject a controller-manager pod into the cluster, then delete it as soon as your existing controller-manager deployment has been scheduled.

For example:

kubectl --namespace=kube-system get deployment scheduler -oyaml

Then take the podSpec section (second indented spec, with a containers field right below):

Something like:

    spec:
      containers:
      - name: kube-controller-manager
        image: quay.io/coreos/hyperkube:v1.4.0_coreos.0
        command:
            [...]

Then wrap that in a pod header, and specify the nodeName it should run on:

apiVersion: v1
kind: Pod
metadata:
  name: recovery-cm
spec:
  nodeName: <a node in your cluster>
  containers:
  - name: kube-controller-manager
    image: quay.io/coreos/hyperkube:v1.4.0_coreos.0
    command:
       [...]

Then inject it in the cluster
kubectl create -f recovery-pod.yaml

What will happen is that this pod will act as controller-manager, convert your existing deployment/controller-manager into pods - then they will be scheduled. After that you can just delete the recovery pod:

kubectl delete -f recovery-pod.yaml

coresolve · 2017-03-15T22:57:55Z

Just hit this as well as a result of a container linux reboot

abourget · 2017-03-17T03:30:26Z

In addition, if your scheduler is down, recovering the controller-manager won't be enough. In this case, put a file like this in /etc/kubernetes/manifests/scheduler.yml on one of the nodes, so it passes your kube-scheduler Deployment into Running rather than Pending state:

#
# Add this to a node in `/etc/kubernetes/manifests` to recover your scheduler, and
# schedule the pods needed to run your configured Deployments :)
#
kind: Pod
apiVersion: v1
metadata:
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
spec:
  containers:
  - name: kube-scheduler
    image: quay.io/coreos/hyperkube:v1.5.3_coreos.0
    command:
    - ./hyperkube
    - scheduler
    - --leader-elect=true
    - --kubeconfig=/etc/kubernetes/kubeconfig
    volumeMounts:
    - name: etc-kubernetes
      mountPath: /etc/kubernetes
      readOnly: true
  volumes:
  - name: etc-kubernetes
    hostPath:
      path: /etc/kubernetes

aaronlevy · 2017-03-17T17:51:04Z

You can do the same steps I outlined above for the scheduler as well (and don't need to actually ssh into a machine to create the static manifest).

abourget · 2017-03-17T21:37:34Z

But who schedules the recovery scheduler if the scheduler is dead? :-)

aaronlevy · 2017-03-17T21:39:23Z

Per the steps outlined above you would populate the pod's spec.NodeName so that the pod is pre-assigned to a node - no scheduler needed.

abourget · 2017-03-17T21:51:07Z

Oh right!! That's great. Thanks :-)

mfburnett · 2017-03-20T20:16:38Z

Is this documented anywhere? Could help our users!

radhikapc · 2017-03-20T20:18:26Z

Working on it @mfburnett as i have hit with the same issue today while upgrading. Should I be creating a doc defect for this, or would you be doing it for me? cc @aaronlevy

mfburnett · 2017-03-20T20:53:21Z

Thanks @radhikapc!

aaronlevy · 2017-04-14T17:06:27Z

Just to track some internal discussions -- another option might be to propose a sub-command to kubectl upstream. Not sure of UX specifics, but maybe something like:

kubectl pod-from deployment/kube-scheduler --target=nodename
kubectl pod-from daemonset/foo --target=nodename
kubectl pod-from podtemplate/foo --target=nodename

abourget · 2017-06-07T17:41:52Z

Another simple way to hook back a Pod to a Node, when Scheduler + Controller-manager are dead:

kubectl create -f rescue-binding.yaml

with this content:

apiVersion: v1
kind: Binding
metadata:
  name: "kube-dns-2431531914-61pgv"
  namespace: "kube-system"
target:
  apiVersion: v1
  kind: Node
  name: "ip-10-22-4-152.us-west-2.compute.internal"

A Binding is the objected injected in the K8s cluster when the Schedule takes a scheduling decision. You can do the same manually.

klausenbusk · 2017-09-20T09:43:34Z

Another simple way to hook back a Pod to a Node, when Scheduler + Controller-manager are dead:

Thanks! This prevented me from restarting the whole cluster (again) with bootkube recover and bootkube start.

Someone should properly add this to: https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/disaster-recovery.md

redbaron · 2017-12-05T16:27:02Z

Been thinking about this, can't checkpointer help here? mark controller-manager as pod to be checkpointed, then it would be recovering it if it can't find it on the node. There can be edge case when it recovers too many pods, but given that they do leader election that is fine to have few extra running.

As for the scheduler, making it daemonset would allow it to be enough just for controller-manager & apiserver to be alive as daemonsets are scheduled by controller, at least in current 1.8.x release

aaronlevy · 2017-12-12T01:21:36Z

With the behavior changes introduced in #755 checkpointing the controller-manager / schedule might be possible (before that change we might garbage the checkpoints of those components before replacements have been scheduled). It might still be a little bit racy though.

cc @diegs

fejta-bot · 2019-04-22T06:50:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-22T07:33:02Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-21T08:23:43Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-21T08:23:50Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aaronlevy added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 30, 2016

aaronlevy added the priority/P1 label Feb 14, 2017

aaronlevy mentioned this issue Mar 9, 2017

Should there be a "kubernetes-operator"? #361

Closed

aaronlevy mentioned this issue Apr 14, 2017

Documentation: Disaster recovery scenarios #432

Closed

aaronlevy mentioned this issue May 17, 2017

Remove critical-pod annotations from scheduler & controller-manager #519

Closed

jamiehannaford mentioned this issue May 24, 2017

Finding a solution for etcd kubernetes/kubeadm#277

Closed

diegs mentioned this issue Jun 15, 2017

Add disaster recovery documentation. #584

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2019

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 22, 2019

k8s-ci-robot closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple scheduler & controller-manager disaster recovery #112

Simple scheduler & controller-manager disaster recovery #112

aaronlevy commented Aug 30, 2016

aaronlevy commented Aug 30, 2016

chancez commented Aug 30, 2016 •

edited

Loading

chancez commented Aug 30, 2016

Raffo commented Sep 28, 2016

aaronlevy commented Sep 29, 2016

coresolve commented Mar 15, 2017

abourget commented Mar 17, 2017

aaronlevy commented Mar 17, 2017

abourget commented Mar 17, 2017

aaronlevy commented Mar 17, 2017

abourget commented Mar 17, 2017

mfburnett commented Mar 20, 2017

radhikapc commented Mar 20, 2017 •

edited

Loading

mfburnett commented Mar 20, 2017

aaronlevy commented Apr 14, 2017

abourget commented Jun 7, 2017

klausenbusk commented Sep 20, 2017

redbaron commented Dec 5, 2017

aaronlevy commented Dec 12, 2017 •

edited

Loading

fejta-bot commented Apr 22, 2019

fejta-bot commented May 22, 2019

fejta-bot commented Jun 21, 2019

k8s-ci-robot commented Jun 21, 2019

Simple scheduler & controller-manager disaster recovery #112

Simple scheduler & controller-manager disaster recovery #112

Comments

aaronlevy commented Aug 30, 2016

aaronlevy commented Aug 30, 2016

chancez commented Aug 30, 2016 • edited Loading

chancez commented Aug 30, 2016

Raffo commented Sep 28, 2016

aaronlevy commented Sep 29, 2016

coresolve commented Mar 15, 2017

abourget commented Mar 17, 2017

aaronlevy commented Mar 17, 2017

abourget commented Mar 17, 2017

aaronlevy commented Mar 17, 2017

abourget commented Mar 17, 2017

mfburnett commented Mar 20, 2017

radhikapc commented Mar 20, 2017 • edited Loading

mfburnett commented Mar 20, 2017

aaronlevy commented Apr 14, 2017

abourget commented Jun 7, 2017

klausenbusk commented Sep 20, 2017

redbaron commented Dec 5, 2017

aaronlevy commented Dec 12, 2017 • edited Loading

fejta-bot commented Apr 22, 2019

fejta-bot commented May 22, 2019

fejta-bot commented Jun 21, 2019

k8s-ci-robot commented Jun 21, 2019

chancez commented Aug 30, 2016 •

edited

Loading

radhikapc commented Mar 20, 2017 •

edited

Loading

aaronlevy commented Dec 12, 2017 •

edited

Loading