Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Simple scheduler & controller-manager disaster recovery #112

Closed
aaronlevy opened this issue Aug 30, 2016 · 23 comments
Closed

Simple scheduler & controller-manager disaster recovery #112

aaronlevy opened this issue Aug 30, 2016 · 23 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1

Comments

@aaronlevy
Copy link
Contributor

There are potential failure cases, where you have permanently lost all schedulers and/or all controller-managers, where recovery leaves you in a chicken-egg state:

For example, assume you have lost all schedulers - but still have a functioning api-server which contains the scheduler deployment object:

You need a controller-manager to convert the deployment into unscheduled pods, and you need a scheduler to then schedule those pods to nodes (scheduler to schedule the scheduler, if you will).

While these types of situations should be mitigated by deploying across failure domains, it is still something we need to cover.

In the short term this could mean documenting, for example, how to create a temporary scheduler pod pre-assigned to a node (once a single scheduler exists, it will then schedule the rest of the scheduler pods).

Another option might be to build a tool which knows how to do this for you based on parsing an existing object:

kube-recover deployment kube-controller-manager --target=<node-ip>
kube-recover deployment kube-scheduler --target=<node-ip>

Where the kube-recover tool could read the object from the api-server (or from disk), parse out the podSpec and pre-assign a pod to the target node (bypassing both need for controller-manager and scheduler).

@aaronlevy aaronlevy added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 30, 2016
@aaronlevy
Copy link
Contributor Author

Thinking longer term, rather than a separate kube-recover tool, this could be a command in kubectl which knows how to extract a pod from higher-order object -- this could be useful in the case where we have a kubelet-pod-api (so we could use this to natively push pod to kubelet api from a deployment/daemonset object rather than needing the intermediate pod state). See #97 (comment)

@chancez
Copy link
Contributor

chancez commented Aug 30, 2016

👍 I think an interesting idea is to play with re-running bootkube for recovery. I found that generally if I just re-run bootkube nothing too crazy happens and the temporary control plane brings up a scheduler, allowing everything to correct itself. I really like the UX of this approach, even if its not a real solution.

In the static manifest case this problem isnt really a problem because the pods which run aren't scheduled, so the scheduler failing in any scenario always results in it being restarted and able to run (assuming it can grab it's lease).

I wonder if there's a good way to do this, perhaps by check pointing the scheduler to nodes which have role=master would be a good start, as an alternative to your proposed kube-recover strategy? This would be mostly automated, the only issue is how do you run this checkpointer in a way that works when the scheduler runs into this? Well, that's probably a static pod, and that means we can't easily manage it, which is problematic if the master nodes are changing.

@chancez
Copy link
Contributor

chancez commented Aug 30, 2016

Also, definitely agree that a kubelet-pod-api that skips scheduling makes this story way better.

@Raffo
Copy link

Raffo commented Sep 28, 2016

Hi guys, I was trying bootkube and I had a similar case. In my case, I figured that the controller had no --cloud-provide=aws flag set and I tried editing the controller-manger with kubectl edit... and it was a bad idea. The result was that I can't recover the cluster cause I am not able to schedule anything.
What I'm thinking is: in this case we have etcd running on the master itself, which is just wrong, but I thought that if we run a reliable, separated etcd cluster, this problem would not exist at all. Just run a new master with bootkube and attach it to the etcd cluster and kill the broken one. Would this make sense?

@aaronlevy
Copy link
Contributor Author

One low-hanging fruit is that we should be deploying multiple copies of the controller-manager/scheduler. In that case you would be doing a rolling-update of the component, verifying that the new functionality works before destroying all of the old copies.

However, there are still situations where we have a loss of all schedulers and/or controller-manager (e.g. maybe a flag change is subtly broken, but the pod is still running so the deployment manager rolls out all broken pods).

You could launch a new master as an option, but if you still have an api-server/etcd running you should be able to recover. Essentially you would need to inject a controller-manager pod into the cluster, then delete it as soon as your existing controller-manager deployment has been scheduled.

For example:

kubectl --namespace=kube-system get deployment scheduler -oyaml

Then take the podSpec section (second indented spec, with a containers field right below):

Something like:

    spec:
      containers:
      - name: kube-controller-manager
        image: quay.io/coreos/hyperkube:v1.4.0_coreos.0
        command:
            [...]

Then wrap that in a pod header, and specify the nodeName it should run on:

apiVersion: v1
kind: Pod
metadata:
  name: recovery-cm
spec:
  nodeName: <a node in your cluster>
  containers:
  - name: kube-controller-manager
    image: quay.io/coreos/hyperkube:v1.4.0_coreos.0
    command:
       [...]

Then inject it in the cluster
kubectl create -f recovery-pod.yaml

What will happen is that this pod will act as controller-manager, convert your existing deployment/controller-manager into pods - then they will be scheduled. After that you can just delete the recovery pod:

kubectl delete -f recovery-pod.yaml

@coresolve
Copy link
Contributor

Just hit this as well as a result of a container linux reboot

@abourget
Copy link

In addition, if your scheduler is down, recovering the controller-manager won't be enough. In this case, put a file like this in /etc/kubernetes/manifests/scheduler.yml on one of the nodes, so it passes your kube-scheduler Deployment into Running rather than Pending state:

#
# Add this to a node in `/etc/kubernetes/manifests` to recover your scheduler, and
# schedule the pods needed to run your configured Deployments :)
#
kind: Pod
apiVersion: v1
metadata:
  name: kube-scheduler
  namespace: kube-system
  labels:
    k8s-app: kube-scheduler
spec:
  containers:
  - name: kube-scheduler
    image: quay.io/coreos/hyperkube:v1.5.3_coreos.0
    command:
    - ./hyperkube
    - scheduler
    - --leader-elect=true
    - --kubeconfig=/etc/kubernetes/kubeconfig
    volumeMounts:
    - name: etc-kubernetes
      mountPath: /etc/kubernetes
      readOnly: true
  volumes:
  - name: etc-kubernetes
    hostPath:
      path: /etc/kubernetes

@aaronlevy
Copy link
Contributor Author

You can do the same steps I outlined above for the scheduler as well (and don't need to actually ssh into a machine to create the static manifest).

@abourget
Copy link

But who schedules the recovery scheduler if the scheduler is dead? :-)

@aaronlevy
Copy link
Contributor Author

Per the steps outlined above you would populate the pod's spec.NodeName so that the pod is pre-assigned to a node - no scheduler needed.

@abourget
Copy link

Oh right!! That's great. Thanks :-)

@mfburnett
Copy link

Is this documented anywhere? Could help our users!

@radhikapc
Copy link

radhikapc commented Mar 20, 2017

Working on it @mfburnett as i have hit with the same issue today while upgrading. Should I be creating a doc defect for this, or would you be doing it for me? cc @aaronlevy

@mfburnett
Copy link

Thanks @radhikapc!

@aaronlevy
Copy link
Contributor Author

Just to track some internal discussions -- another option might be to propose a sub-command to kubectl upstream. Not sure of UX specifics, but maybe something like:

kubectl pod-from deployment/kube-scheduler --target=nodename
kubectl pod-from daemonset/foo --target=nodename
kubectl pod-from podtemplate/foo --target=nodename

@abourget
Copy link

abourget commented Jun 7, 2017

Another simple way to hook back a Pod to a Node, when Scheduler + Controller-manager are dead:

kubectl create -f rescue-binding.yaml

with this content:

apiVersion: v1
kind: Binding
metadata:
  name: "kube-dns-2431531914-61pgv"
  namespace: "kube-system"
target:
  apiVersion: v1
  kind: Node
  name: "ip-10-22-4-152.us-west-2.compute.internal"

A Binding is the objected injected in the K8s cluster when the Schedule takes a scheduling decision. You can do the same manually.

@klausenbusk
Copy link
Contributor

Another simple way to hook back a Pod to a Node, when Scheduler + Controller-manager are dead:

Thanks! This prevented me from restarting the whole cluster (again) with bootkube recover and bootkube start.

Someone should properly add this to: https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/disaster-recovery.md

@redbaron
Copy link
Contributor

redbaron commented Dec 5, 2017

Been thinking about this, can't checkpointer help here? mark controller-manager as pod to be checkpointed, then it would be recovering it if it can't find it on the node. There can be edge case when it recovers too many pods, but given that they do leader election that is fine to have few extra running.

As for the scheduler, making it daemonset would allow it to be enough just for controller-manager & apiserver to be alive as daemonsets are scheduled by controller, at least in current 1.8.x release

@aaronlevy
Copy link
Contributor Author

aaronlevy commented Dec 12, 2017

With the behavior changes introduced in #755 checkpointing the controller-manager / schedule might be possible (before that change we might garbage the checkpoints of those components before replacements have been scheduled). It might still be a little bit racy though.

cc @diegs

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2019
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 22, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1
Projects
None yet
Development

No branches or pull requests