These are notes to accompany my KubeCon EU 2017 talk. The slides are available as well.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Kubernetes Day 2

These are notes to accompany my KubeCon EU 2017 talk. The slides are available as well. The video is available from Youtube.

How do you keep a Kubernetes cluster running long term? Just like any other service, you need a combination of monitoring, alerting, backup, upgrade, and infrastructure management strategies to make it happen. This talk will walk through and demonstrate the best practices for each of these questions and show off the latest tooling that makes it possible. The takeaway will be lessons and considerations that will influence the way you operate your own Kubernetes clusters.


These are notes for a conference talk. Much of this may become out of date very quickly. My goal is to turn much of this into docs overtime.

Cluster Setup

All of the demos in this talk were done with a self-hosted cluster deployed with the Tectonic Installer on AWS.

This cluster was also deployed using the self-hosted etcd option which at the time of this writing isn't merged into the Tectonic Installer quite yet.

Failing a Scheduler

Scale it down to remove all schedulers

kubectl scale -n kube-system deployment kube-scheduler --replicas=0

OH NO, scale it back up

kubectl scale -n kube-system deployment kube-scheduler --replicas=1

Unfortunately, it is too late. Everything is ruined?!?!

kubectl get pods -l k8s-app=kube-scheduler -n kube-system
NAME                              READY     STATUS    RESTARTS   AGE
kube-scheduler-3027616201-53jfh   0/1       Pending   0          52s

Get the current kubernetes deployment

kubectl get -n kube-system deployment -o yaml kube-scheduler > sched.yaml

Pick a node name from this list at random

kubectl get nodes -l master=true

Edit the sched.yaml to use just the pod spec and set the metadata.nodename field to one to the selected node above. Something like this:

kind: Pod
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
  - command:
    - ./hyperkube
    - scheduler
    - --leader-elect=true
    imagePullPolicy: IfNotPresent
    name: kube-scheduler
    resources: {}
    terminationMessagePath: /dev/termination-log
  dnsPolicy: ClusterFirst
    master: "true"
  restartPolicy: Always
  securityContext: {}
  terminationGracePeriodSeconds: 30

At this point the deployment scheduler should be ready and can take over

kubectl get pods -l k8s-app=kube-scheduler -n kube-system

Delete the temporary pod

kubectl delete pod -n kube-system kube-scheduler<Paste>

Downgrade/Upgrade Scheduler

Edit the scheduler and downgrade a patch release.

kubectl edit -n kube-system deployment kube-scheduler

Now edit the scheduler and upgrade a patch release.

kubectl edit -n kube-system deployment kube-scheduler


kubectl drain and corden

$ kubectl get nodes
NAME                                        STATUS    AGE   Ready     19h

To make a node unschedulable and remove all pods run the following

kubectl drain 

kubectl cordon and uncordon

To ensure a node doesn't get additional workloads you can cordon/uncordon a node. This is very useful to investigate an issue and to ensure a node doesn't change while debugging.

$ kubectl cordon
node "" cordoned

To undo run uncordon

$ kubectl uncordon
node "" uncordoned


Using contrib/kube-prometheus deployed in the self-hosted configuration.

Proxy to run queries against prometheus

while true; do kubectl port-forward -n monitoring prometheus-k8s-0 9090; don

NOTE: a few bugs were found and filed against this configuration

Configure etcd backup

Note: S3 backup isn't working in the etcd Operator on self-hosted yet; hunting this down.

Setup AWS upload creds:

kubectl create secret generic aws-credential --from-file=$HOME/.aws/credentials -n kube-system
kubectl create configmap aws-config --from-file=$HOME/.aws/config-us-west-1 -n kube-system
kubectl edit deployment etcd-operator -n kube-system
      - command:
        - /usr/local/bin/etcd-operator
        - --backup-aws-secret
        - aws-credential
        - --backup-aws-config
        - aws-config
        - --backup-s3-bucket
        - tectonic-eo-etcd-backups
kubectl get cluster.etcd -n kube-system kube-etcd -o yaml > etcd
kubectl replace -f etcd  -n kube-system