# Replication and other controllers: deploying managed pods

As you’ve learned so far, pods represent the basic deployable unit in Kubernetes.
You know how to create, supervise, and manage them manually. But in real-world
use cases, you want your deployments to stay up and running automatically and
remain healthy without any manual intervention. To do this, you almost never create
pods directly. Instead, you create other types of resources, such as Replication-
Controllers or Deployments, which then create and manage the actual pods.

When you create unmanaged pods (such as the ones you created in the previous
chapter), a cluster node is selected to run the pod and then its containers are
run on that node. In this chapter, you’ll learn that Kubernetes then monitors
those containers and automatically restarts them if they fail. But if the whole node
fails, the pods on the node are lost and will not be replaced with new ones, unless
those pods are managed by the previously mentioned ReplicationControllers or similar.
In this chapter, you’ll learn how Kubernetes checks if a container is still alive and
restarts it if it isn’t. You’ll also learn how to run managed pods—both those that run
indefinitely and those that perform a single task and then stop.

## Keeping pods healthy

One of the main benefits of using Kubernetes is the ability to give it a list of containers
and let it keep those containers running somewhere in the cluster. You do this by
creating a Pod resource and letting Kubernetes pick a worker node for it and run
the pod’s containers on that node. But what if one of those containers dies? What if
all containers of a pod die?

As soon as a pod is scheduled to a node, the Kubelet on that node will run its containers
and, from then on, keep them running as long as the pod exists. If the container’s
main process crashes, the Kubelet will restart the container. If your
application has a bug that causes it to crash every once in a while, Kubernetes will
restart it automatically, so even without doing anything special in the app itself, running
the app in Kubernetes automatically gives it the ability to heal itself.

But sometimes apps stop working without their process crashing. For example, a
Java app with a memory leak will start throwing OutOfMemoryErrors, but the JVM
process will keep running. It would be great to have a way for an app to signal to
Kubernetes that it’s no longer functioning properly and have Kubernetes restart it.
We’ve said that a container that crashes is restarted automatically, so maybe you’re
thinking you could catch these types of errors in the app and exit the process when
they occur. You can certainly do that, but it still doesn’t solve all your problems.
For example, what about those situations when your app stops responding because
it falls into an infinite loop or a deadlock? To make sure applications are restarted in
such cases, you must check an application’s health from the outside and not depend
on the app doing it internally.

Kubernetes can probe a container using one of the three mechanisms:
- An `HTTP GET probe` performs an HTTP GET request on the container’s IP
address, a port and path you specify. If the probe receives a response, and the
response code doesn’t represent an error (in other words, if the HTTP response
code is 2xx or 3xx), the probe is considered successful. If the server returns an
error response code or if it doesn’t respond at all, the probe is considered a failure
and the container will be restarted as a result.
- A `TCP Socket probe` tries to open a TCP connection to the specified port of the
container. If the connection is established successfully, the probe is successful.
Otherwise, the container is restarted.
- An `Exec probe` executes an arbitrary command inside the container and checks
the command’s exit status code. If the status code is 0, the probe is successful.
All other codes are considered failures.

### Creating an HTTP-based liveness probe

Let’s see how to add a liveness probe to your Node.js app. Because it’s a web app, it
makes sense to add a liveness probe that will check whether its web server is serving
requests. But because this particular Node.js app is too simple to ever fail, you’ll need
to make the app fail artificially.

To properly demo liveness probes, you’ll modify the app slightly and make it
return a 500 Internal Server Error HTTP status code for each request after the fifth
one—your app will handle the first five client requests properly and then return an
error on every subsequent request. Thanks to the liveness probe, it should be restarted
when that happens, allowing it to properly handle client requests again.
You can find the code of the new app in the book’s code archive (in the folder
Chapter04/kubia-unhealthy). I’ve pushed the container image to Docker Hub, so you
don’t need to build it yourself.

You’ll create a new pod that includes an HTTP GET liveness probe. The following
listing shows the YAML for the pod.

```yml
apiVersion: v1
kind: Pod
metadata:
  name: kubia-liveness
spec:
  containers:
  - image: luksa/kubia-unhealthy
    name: kubia
    livenessProbe:
      httpGet:
        path: /
        port: 8080
```

    kubectl create -f kubia-liveness-probe.yaml

periodically
perform HTTP GET requests on path / on port 8080 to determine if the container
is still healthy. These requests start as soon as the container is run.

After five such requests (or actual client requests), your app starts returning
HTTP status code 500, which Kubernetes will treat as a probe failure, and will thus
restart the container.

### Seeing a liveness probe in action

To see what the liveness probe does, try creating the pod now. After about a minute and
a half, the container will be restarted. You can see that by running kubectl get:

    kubectl get po kubia-liveness

The RESTARTS column shows that the pod’s container has been restarted once (if you
wait another minute and a half, it gets restarted again, and then the cycle continues
indefinitely).

You can see why the container had to be restarted by looking at what kubectl describe
prints out, as shown in the following listing.

    kubectl describe po kubia-liveness

You can see that the container is currently running, but it previously terminated
because of an error. The exit code was 137, which has a special meaning—it denotes
that the process was terminated by an external signal. The number 137 is a sum of two
numbers: 128+x, where x is the signal number sent to the process that caused it to terminate.
In the example, x equals 9, which is the number of the SIGKILL signal, meaning
the process was killed forcibly.

The events listed at the bottom show why the container was killed—Kubernetes
detected the container was unhealthy, so it killed and re-created it.

> NOTE When a container is killed, a completely new container is created—it’s
not the same container being restarted again.

### Configuring additional properties of the liveness probe

You may have noticed that kubectl describe also displays additional information
about the liveness probe:
    
    Liveness: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1
    ➥ #failure=3

Beside the liveness probe options you specified explicitly, you can also see additional
properties, such as delay, timeout, period, and so on. The delay=0s part shows that
the probing begins immediately after the container is started. The timeout is set to
only 1 second, so the container must return a response in 1 second or the probe is
counted as failed. The container is probed every 10 seconds (period=10s) and the
container is restarted after the probe fails three consecutive times (#failure=3).

    
These additional parameters can be customized when defining the probe. For
example, to set the initial delay, add the initialDelaySeconds property to the liveness
probe as shown in the following listing.

    kubia-liveness-probe-initial-delay.yaml
```yml
livenessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 15
```

If you don’t set the initial delay, the prober will start probing the container as soon as
it starts, which usually leads to the probe failing, because the app isn’t ready to start
receiving requests. If the number of failures exceeds the failure threshold, the container
is restarted before it’s even able to start responding to requests properly.

> TIP Always remember to set an initial delay to account for your app’s startup
time.

I’ve seen this on many occasions and users were confused why their container was
being restarted. But if they’d used kubectl describe, they’d have seen that the container
terminated with exit code 137 or 143, telling them that the pod was terminated
externally. Additionally, the listing of the pod’s events would show that the container
was killed because of a failed liveness probe. If you see this happening at pod startup,
it’s because you failed to set initialDelaySeconds appropriately.

> NOTE Exit code 137 signals that the process was killed by an external signal
(exit code is 128 + 9 (SIGKILL). Likewise, exit code 143 corresponds to 128 +
15 (SIGTERM).

### Creating effective liveness probes

For pods running in production, you should always define a liveness probe. Without
one, Kubernetes has no way of knowing whether your app is still alive or not. As long
as the process is still running, Kubernetes will consider the container to be healthy.

**WHAT A LIVENESS PROBE SHOULD CHECK**

Your simplistic liveness probe simply checks if the server is responding. While this may
seem overly simple, even a liveness probe like this does wonders, because it causes the
container to be restarted if the web server running within the container stops
responding to HTTP requests. Compared to having no liveness probe, this is a major
improvement, and may be sufficient in most cases.

But for a better liveness check, you’d configure the probe to perform requests on a
specific URL path (/health, for example) and have the app perform an internal status
check of all the vital components running inside the app to ensure none of them
has died or is unresponsive.

> TIP Make sure the /health HTTP endpoint doesn’t require authentication;
otherwise the probe will always fail, causing your container to be restarted
indefinitely.

Be sure to check only the internals of the app and nothing influenced by an external
factor. For example, a frontend web server’s liveness probe shouldn’t return a failure
when the server can’t connect to the backend database. If the underlying cause is in
the database itself, restarting the web server container will not fix the problem.

Because the liveness probe will fail again, you’ll end up with the container restarting
repeatedly until the database becomes accessible again.

**KEEPING PROBES LIGHT**

Liveness probes shouldn’t use too many computational resources and shouldn’t take
too long to complete. By default, the probes are executed relatively often and are
only allowed one second to complete. Having a probe that does heavy lifting can slow
down your container considerably. Later in the book, you’ll also learn about how to
limit CPU time available to a container. The probe’s CPU time is counted in the container’s
CPU time quota, so having a heavyweight liveness probe will reduce the CPU
time available to the main application processes.

TIP If you’re running a Java app in your container, be sure to use an HTTP
GET liveness probe instead of an Exec probe, where you spin up a whole new
JVM to get the liveness information. The same goes for any JVM-based or similar
applications, whose start-up procedure requires considerable computational
resources.

**DON’T BOTHER IMPLEMENTING RETRY LOOPS IN YOUR PROBES**

You’ve already seen that the failure threshold for the probe is configurable and usually
the probe must fail multiple times before the container is killed. But even if you
set the failure threshold to 1, Kubernetes will retry the probe several times before considering
it a single failed attempt. Therefore, implementing your own retry loop into
the probe is wasted effort.

**LIVENESS PROBE WRAP-UP**

You now understand that Kubernetes keeps your containers running by restarting
them if they crash or if their liveness probes fail. This job is performed by the Kubelet
on the node hosting the pod—the Kubernetes Control Plane components running on
the master(s) have no part in this process.

But if the node itself crashes, it’s the Control Plane that must create replacements for
all the pods that went down with the node. It doesn’t do that for pods that you create
directly. Those pods aren’t managed by anything except by the Kubelet, but because the
Kubelet runs on the node itself, it can’t do anything if the node fails.

To make sure your app is restarted on another node, you need to have the pod
managed by a ReplicationController or similar mechanism, which we’ll discuss in the
rest of this chapter.

## ReplicationControllers

A ReplicationController is a Kubernetes resource that ensures its pods are always
kept running. If the pod disappears for any reason, such as in the event of a node
disappearing from the cluster or because the pod was evicted from the node, the
ReplicationController notices the missing pod and creates a replacement pod.
Figure 4.1 shows what happens when a node goes down and takes two pods with it.

Pod A was created directly and is therefore an unmanaged pod, while pod B is managed
by a ReplicationController. After the node fails, the ReplicationController creates a
new pod (pod B2) to replace the missing pod B, whereas pod A is lost completely—
nothing will ever recreate it.

The ReplicationController in the figure manages only a single pod, but Replication-
Controllers, in general, are meant to create and manage multiple copies (replicas) of a
pod. That’s where ReplicationControllers got their name from.

### The operation of a ReplicationController

Like many things in Kubernetes, a ReplicationController, although an incredibly simple
concept, provides or enables the following powerful features:
- It makes sure a pod (or multiple pod replicas) is always running by starting a
new pod when an existing one goes missing.
- When a cluster node fails, it creates replacement replicas for all the pods that
were running on the failed node (those that were under the Replication-
Controller’s control).
- It enables easy horizontal scaling of pods—both manual and automatic (see
horizontal pod auto-scaling in chapter 15).

> NOTE A pod instance is never relocated to another node. Instead, the
ReplicationController creates a completely new pod instance that has no relation
to the instance it’s replacing.

### Creating a ReplicationController

Let’s look at how to create a ReplicationController and then see how it keeps your
pods running. Like pods and other Kubernetes resources, you create a Replication-
Controller by posting a JSON or YAML descriptor to the Kubernetes API server.

You’re going to create a YAML file called kubia-rc.yaml for your Replication-
Controller, as shown in the following listing.

```yml

apiVersion: v1
kind: ReplicationController
metadata:
  name: kubia
spec:
  replicas: 3
  selector:
    app: kubia
  template:
    metadata:
      labels:
        app: kubia
    spec:
      containers:
      - name: kubia
        image: luksa/kubia
        ports:
        - containerPort: 8080
            
```

When you post the file to the API server, Kubernetes creates a new Replication-
Controller named kubia, which makes sure three pod instances always match the
label selector app=kubia. When there aren’t enough pods, new pods will be created
from the provided pod template. The contents of the template are almost identical to
the pod definition you created in the previous chapter.

The pod labels in the template must obviously match the label selector of the
ReplicationController; otherwise the controller would create new pods indefinitely,
because spinning up a new pod wouldn’t bring the actual replica count any closer to
the desired number of replicas. To prevent such scenarios, the API server verifies the
ReplicationController definition and will not accept it if it’s misconfigured.
Not specifying the selector at all is also an option. In that case, it will be configured
automatically from the labels in the pod template.

> IP Don’t specify a pod selector when defining a ReplicationController. Let
Kubernetes extract it from the pod template. This will keep your YAML
shorter and simpler.

To create the ReplicationController, use the kubectl create command, which you
already know:

    $ kubectl apply  -f kubia-rc.yaml

    
As soon as the ReplicationController is created, it goes to work. Let’s see what
it does.

### Seeing the ReplicationController in action

Because no pods exist with the app=kubia label, the ReplicationController should
spin up three new pods from the pod template. List the pods to see if the Replication-
Controller has done what it’s supposed to:

    kubectl get pods

Indeed, it has! You wanted three pods, and it created three pods. It’s now managing
those three pods. Next you’ll mess with them a little to see how the Replication-
Controller responds.

**SEEING THE REPLICATIONCONTROLLER RESPOND TO A DELETED POD**

First, you’ll delete one of the pods manually to see how the ReplicationController spins
up a new one immediately, bringing the number of matching pods back to three:

    kubectl delete pod kubia-53thy

Listing the pods again shows four of them, because the one you deleted is terminating,
and a new pod has already been created:
    
    kubectl get pods

The ReplicationController has done its job again. It’s a nice little helper, isn’t it?

**GETTING INFORMATION ABOUT A REPLICATIONCONTROLLER**

Now, let’s see what information the kubectl get command shows for Replication-Controllers:

    kubectl get rc

> NOTE We’re using rc as a shorthand for replicationcontroller.

You see three columns showing the desired number of pods, the actual number of
pods, and how many of them are ready (you’ll learn what that means in the next chapter,
when we talk about readiness probes).

You can see additional information about your ReplicationController with the
kubectl describe command, as shown in the following listing.

    kubectl describe rc kubia


The list of events at the bottom shows the actions taken by the Replication-
Controller—it has created four pods so far.

**RESPONDING TO A NODE FAILURE**

Seeing the ReplicationController respond to the manual deletion of a pod isn’t too
interesting, so let’s look at a better example. If you’re using Google Kubernetes Engine
to run these examples, you have a three-node Kubernetes cluster. You’re going to disconnect
one of the nodes from the network to simulate a node failure.

> NOTE If you’re using Minikube, you can’t do this exercise, because you only
have one node that acts both as a master and a worker node.

If a node fails in the non-Kubernetes world, the ops team would need to migrate the
applications running on that node to other machines manually. Kubernetes, on the
other hand, does that automatically. Soon after the ReplicationController detects that
its pods are down, it will spin up new pods to replace them.

Let’s see this in action. You need to ssh into one of the nodes with the gcloud
compute ssh command and then shut down its network interface with sudo ifconfig
eth0 down, as shown in the following listing.

> NOTE Choose a node that runs at least one of your pods by listing pods with
the -o wide option.

Simulating a node failure by shutting down its network interface:

    gcloud compute ssh gke-kubia-default-pool-b46381f1-zwko

    sudo ifconfig eth0 down

When you shut down the network interface, the ssh session will stop responding, so
you need to open up another terminal or hard-exit from the ssh session. In the new
terminal you can list the nodes to see if Kubernetes has detected that the node is
down. This takes a minute or so. Then, the node’s status is shown as NotReady:

    kubectl get node

If you list the pods now, you’ll still see the same three pods as before, because Kubernetes
waits a while before rescheduling pods (in case the node is unreachable because
of a temporary network glitch or because the Kubelet is restarting). If the node stays
unreachable for several minutes, the status of the pods that were scheduled to that
node changes to Unknown. At that point, the ReplicationController will immediately
spin up a new pod. You can see this by listing the pods again:

    kubectl get pods
    

Looking at the age of the pods, you see that the kubia-dmdck pod is new. You again
have three pod instances running, which means the ReplicationController has again
done its job of bringing the actual state of the system to the desired state.
The same thing happens if a node fails (either breaks down or becomes unreachable).
No immediate human intervention is necessary. The system heals itself
automatically.

To bring the node back, you need to reset it with the following command:

    gcloud compute instances reset gke-kubia-default-pool-b46381f1-zwko
    
When the node boots up again, its status should return to Ready, and the pod whose
status was Unknown will be deleted.

### Horizontally scaling pods

You’ve seen how ReplicationControllers make sure a specific number of pod instances
is always running. Because it’s incredibly simple to change the desired number of replicas,
this also means scaling pods horizontally is trivial.

Scaling the number of pods up or down is as easy as changing the value of the replicas
field in the ReplicationController resource. After the change, the Replication-
Controller will either see too many pods exist (when scaling down) and delete part of
them, or see too few of them (when scaling up) and create additional pods.

**SCALING UP A REPLICATIONCONTROLLER**

Your ReplicationController has been keeping three instances of your pod running.
You’re going to scale that number up to 10 now. As you may remember, you’ve
already scaled a ReplicationController in chapter 2. You could use the same command
as before:
    
    kubectl scale rc kubia --replicas=10

**SCALING A REPLICATIONCONTROLLER BY EDITING ITS DEFINITION**

Instead of using the kubectl scale command, you’re going to scale it in a declarative
way by editing the ReplicationController’s definition:

    kubectl edit rc kubia

When the text editor opens, find the spec.replicas field and change its value to 10,
as shown in the following listing.

When you save the file and close the editor, the ReplicationController is updated and
it immediately scales the number of pods to 10:

    kubectl get rc

There you go. If the kubectl scale command makes it look as though you’re telling
Kubernetes exactly what to do, it’s now much clearer that you’re making a declarative
change to the desired state of the ReplicationController and not telling Kubernetes to
do something.

**SCALING DOWN WITH THE KUBECTL SCALE COMMAND**

Now scale back down to 3. You can use the kubectl scale command:
    
    kubectl scale rc kubia --replicas=3

All this command does is modify the spec.replicas field of the ReplicationController’s
definition—like when you changed it through kubectl edit.

**UNDERSTANDING THE DECLARATIVE APPROACH TO SCALING**

Horizontally scaling pods in Kubernetes is a matter of stating your desire: “I want to
have x number of instances running.” You’re not telling Kubernetes what or how to do
it. You’re just specifying the desired state.

This declarative approach makes interacting with a Kubernetes cluster easy. Imagine
if you had to manually determine the current number of running instances and
then explicitly tell Kubernetes how many additional instances to run. That’s more
work and is much more error-prone. Changing a simple number is much easier, and
in chapter 15, you’ll learn that even that can be done by Kubernetes itself if you
enable horizontal pod auto-scaling

### Deleting a ReplicationController

When you delete a ReplicationController through kubectl delete, the pods are also
deleted. But because pods created by a ReplicationController aren’t an integral part
of the ReplicationController, and are only managed by it, you can delete only the
ReplicationController and leave the pods running, as shown in figure 4.7.

This may be useful when you initially have a set of pods managed by a Replication-
Controller, and then decide to replace the ReplicationController with a ReplicaSet,
for example (you’ll learn about them next.). You can do this without affecting the
pods and keep them running without interruption while you replace the Replication-
Controller that manages them.

When deleting a ReplicationController with kubectl delete, you can keep its
pods running by passing the --cascade=false option to the command. Try that now:

    $ kubectl delete rc kubia --cascade=false


You’ve deleted the ReplicationController so the pods are on their own. They are no
longer managed. But you can always create a new ReplicationController with the
proper label selector and make them managed again.

## Using ReplicaSets instead of ReplicationControllers

Initially, ReplicationControllers were the only Kubernetes component for replicating
pods and rescheduling them when nodes failed. Later, a similar resource called a
ReplicaSet was introduced. It’s a new generation of ReplicationController and
replaces it completely (ReplicationControllers will eventually be deprecated).

You could have started this chapter by creating a ReplicaSet instead of a Replication-
Controller, but I felt it would be a good idea to start with what was initially available in
Kubernetes. Plus, you’ll still see ReplicationControllers used in the wild, so it’s good
for you to know about them. That said, you should always create ReplicaSets instead
of ReplicationControllers from now on. They’re almost identical, so you shouldn’t
have any trouble using them instead.

You usually won’t create them directly, but instead have them created automatically
when you create the higher-level Deployment resource, which you’ll learn about
in chapter 9. In any case, you should understand ReplicaSets, so let’s see how they differ
from ReplicationControllers.

### Comparing a ReplicaSet to a ReplicationController

A ReplicaSet behaves exactly like a ReplicationController, but it has more expressive
pod selectors. Whereas a ReplicationController’s label selector only allows matching
pods that include a certain label, a ReplicaSet’s selector also allows matching pods
that lack a certain label or pods that include a certain label key, regardless of
its value.

Also, for example, a single ReplicationController can’t match pods with the label
env=production and those with the label env=devel at the same time. It can only match
either pods with the env=production label or pods with the env=devel label. But a single
ReplicaSet can match both sets of pods and treat them as a single group.

Similarly, a ReplicationController can’t match pods based merely on the presence
of a label key, regardless of its value, whereas a ReplicaSet can. For example, a Replica-
Set can match all pods that include a label with the key env, whatever its actual value is
(you can think of it as env=*).

### Defining a ReplicaSet

You’re going to create a ReplicaSet now to see how the orphaned pods that were created
by your ReplicationController and then abandoned earlier can now be adopted
by a ReplicaSet. First, you’ll rewrite your ReplicationController into a ReplicaSet by
creating a new file called kubia-replicaset.yaml with the contents in the following
listing.

    kubectl apply -f kubia-replicaset.yaml

```yml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: kubia
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kubia
  template:
    metadata:
      labels:
        app: kubia
    spec:
      containers:
      - name: kubia
        image: luksa/kubia
```

You’re creating are source of type ReplicaSet which has much the same contents as the ReplicationController you created earlier.

The only difference is in the selector. Instead of listing labels the pods need to
have directly under the selector property, you’re specifying them under selector
.matchLabels. This is the simpler (and less expressive) way of defining label selectors
in a ReplicaSet. Later, you’ll look at the more expressive option, as well.

Because you still have three pods matching the app=kubia selector running from earlier,
creating this ReplicaSet will not cause any new pods to be created. The ReplicaSet
will take those existing three pods under its wing.

### Creating and examining a ReplicaSet

Create the ReplicaSet from the YAML file with the kubectl create command. After
that, you can examine the ReplicaSet with kubectl get and kubectl describe:

    kubectl get rs

> TIP Use rs shorthand, which stands for replicaset.

    kubectl describe rs

As you can see, the ReplicaSet isn’t any different from a ReplicationController. It’s
showing it has three replicas matching the selector. If you list all the pods, you’ll see
they’re still the same three pods you had before. The ReplicaSet didn’t create any new
ones.

This was a quick introduction to ReplicaSets as an alternative to ReplicationControllers.
Remember, always use them instead of ReplicationControllers, but you may still find
ReplicationControllers in other people’s deployments.
Now, delete the ReplicaSet to clean up your cluster a little. You can delete the
ReplicaSet the same way you’d delete a ReplicationController:
    
    kubectl delete rs kubia

  
Deleting the ReplicaSet should delete all the pods. List the pods to confirm that’s
the case.

## Running exactly one pod on each node with DaemonSets

Both ReplicationControllers and ReplicaSets are used for running a specific number of pods deployed anywhere in the Kubernetes cluster. But certain cases exist when you want a pod to run on each and every node in the cluster (and each node needs to run exactly one instance of the pod, as shown in figure 4.8).

Those cases include infrastructure-related pods that perform system-level operations. For example, you’ll want to run a log collector and a resource monitor on every node. Another good example is Kubernetes’ own kube-proxy process, which needs to run on all nodes to make services work.

Outside of Kubernetes, such processes would usually be started through system init scripts or the systemd daemon during node boot up. On Kubernetes nodes, you can still use systemd to run your system processes, but then you can’t take advantage of all the features Kubernetes provides.

### Using a DaemonSet to run a pod on every node

To run a pod on all cluster nodes, you create a DaemonSet object, which is much like a ReplicationController or a ReplicaSet, except that pods created by a DaemonSet already have a target node specified and skip the Kubernetes Scheduler. They aren’t scattered around the cluster randomly.

A DaemonSet makes sure it creates as many pods as there are nodes and deploys each one on its own node, as shown in figure 4.8.

Whereas a ReplicaSet (or ReplicationController) makes sure that a desired number of pod replicas exist in the cluster, a DaemonSet doesn’t have any notion of a desired replica count. It doesn’t need it because its job is to ensure that a pod matching its pod selector is running on each node.

If a node goes down, the DaemonSet doesn’t cause the pod to be created elsewhere. But when a new node is added to the cluster, the DaemonSet immediately deploys a new pod instance to it. It also does the same if someone inadvertently deletes one of the pods, leaving the node without the DaemonSet’s pod. Like a ReplicaSet, a DaemonSet creates the pod from the pod template configured in it.

### Using a DaemonSet to run pods only on certain nodes

A DaemonSet deploys pods to all nodes in the cluster, unless you specify that the pods should only run on a subset of all the nodes. This is done by specifying the node-Selector property in the pod template, which is part of the DaemonSet definition (similar to the pod template in a ReplicaSet or ReplicationController).

You’ve already used node selectors to deploy a pod onto specific nodes in chapter 3. A node selector in a DaemonSet is similar—it defines the nodes the DaemonSet must deploy its pods to.

> Later in the book, you’ll learn that nodes can be made unschedulable, preventing pods from being deployed to them. A DaemonSet will deploy pods even to such nodes, because the unschedulable attribute is only used by the Scheduler, whereas pods managed by a DaemonSet bypass the Scheduler completely. This is usually desirable, because DaemonSets are meant to run system services, which usually need to run even on unschedulable nodes.

**Explaining DaemonSets with an example**

Let’s imagine having a daemon called ssd-monitor that needs to run on all nodes that contain a solid-state drive (SSD). You’ll create a DaemonSet that runs this daemon on all nodes that are marked as having an SSD. The cluster administrators have added the disk=ssd label to all such nodes, so you’ll create the DaemonSet with a node selector that only selects nodes with that label, as shown in figure 4.9.

**Creating a DaemonSet YAML definition**

You’ll create a DaemonSet that runs a mock ssd-monitor process, which prints “SSD OK” to the standard output every five seconds. I’ve already prepared the mock container image and pushed it to Docker Hub, so you can use it instead of building your own. Create the YAML for the DaemonSet, as shown in the following listing.

```yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ssd-monitor
spec:
  selector:
    matchLabels:
      app: ssd-monitor
  template:
    metadata:
      labels:
        app: ssd-monitor
    spec:
      nodeSelector:
        disk: ssd
      containers:
      - name: main
        image: luksa/ssd-monitor
```

You’re defining a DaemonSet that will run a pod with a single container based on the luksa/ssd-monitor container image. An instance of this pod will be created for each node that has the disk=ssd label.

**Creating the DaemonSet**

You’ll create the DaemonSet like you always create resources from a YAML file:

    kubectl apply -f ssd-monitor-daemonset.yaml

Let’s see the created DaemonSet:

    kubectl get ds

Those zeroes look strange. Didn’t the DaemonSet deploy any pods? List the pods:

    kubectl get po

Where are the pods? Do you know what’s going on? Yes, you forgot to label your nodes with the disk=ssd label. No problem—you can do that now. The DaemonSet should detect that the nodes’ labels have changed and deploy the pod to all nodes with a matching label. Let’s see if that’s true.

**Adding the required label to your node(s)**

Regardless if you’re using Minikube, GKE, or another multi-node cluster, you’ll need to list the nodes first, because you’ll need to know the node’s name when labeling it:

    kubectl get node



Now, add the disk=ssd label to one of your nodes like this:

    kubectl label node minikube disk=ssd

The DaemonSet should have created one pod now. Let’s see:

    kubectl get po

Okay; so far so good. If you have multiple nodes and you add the same label to further nodes, you’ll see the DaemonSet spin up pods for each of them.

**Removing the required label from the node**

Now, imagine you’ve made a mistake and have mislabeled one of the nodes. It has a spinning disk drive, not an SSD. What happens if you change the node’s label?

     kubectl label node minikube disk=hdd --overwrite

Let’s see if the change has any effect on the pod that was running on that node:

    kubectl get po

The pod is being terminated. But you knew that was going to happen, right? This wraps up your exploration of DaemonSets, so you may want to delete your ssd-monitor DaemonSet. If you still have any other daemon pods running, you’ll see that deleting the DaemonSet deletes those pods as well.

    kubectl delete ds ssd-monitor 

## Running pods that perform a single completable task

Up to now, we’ve only talked about pods than need to run continuously. You’ll have cases where you only want to run a task that terminates after completing its work. ReplicationControllers, ReplicaSets, and DaemonSets run continuous tasks that are never considered completed. Processes in such pods are restarted when they exit. But in a completable task, after its process terminates, it should not be restarted again.

### Introducing the Job resource
Kubernetes includes support for this through the Job resource, which is similar to the other resources we’ve discussed in this chapter, but it allows you to run a pod whose container isn’t restarted when the process running inside finishes successfully. Once it does, the pod is considered complete.

In the event of a node failure, the pods on that node that are managed by a Job will be rescheduled to other nodes the way ReplicaSet pods are. In the event of a failure of the process itself (when the process returns an error exit code), the Job can be configured to either restart the container or not.

Figure 4.10 shows how a pod created by a Job is rescheduled to a new node if the node it was initially scheduled to fails. The figure also shows both a managed pod, which isn’t rescheduled, and a pod backed by a ReplicaSet, which is.

For example, Jobs are useful for ad hoc tasks, where it’s crucial that the task finishes properly. You could run the task in an unmanaged pod and wait for it to finish, but in the event of a node failing or the pod being evicted from the node while it is performing its task, you’d need to manually recreate it. Doing this manually doesn’t make sense—especially if the job takes hours to complete.

An example of such a job would be if you had data stored somewhere and you needed to transform and export it somewhere. You’re going to emulate this by running a container image built on top of the busybox image, which invokes the sleep command for two minutes. I’ve already built the image and pushed it to Docker Hub, but you can peek into its Dockerfile in the book’s code archive.

###  Defining a Job resource

Create the Job manifest as in the following listing.

```yml
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    metadata:
      labels:
        app: batch-job
    spec:
      restartPolicy: OnFailure
      containers:
      - name: main
        image: luksa/batch-job
```

Jobs are part of the batch API group and v1 API version. The YAML defines a resource of type Job that will run the luksa/batch-job image, which invokes a process that runs for exactly 120 seconds and then exits.

In a pod’s specification, you can specify what Kubernetes should do when the processes running in the container finish. This is done through the restartPolicy pod spec property, which defaults to Always. Job pods can’t use the default policy, because they’re not meant to run indefinitely. Therefore, you need to explicitly set the restart policy to either OnFailure or Never. This setting is what prevents the container from being restarted when it finishes (not the fact that the pod is being managed by a Job resource).

    kubectl apply -f ex05-exporter.yaml

### Seeing a Job run a pod

After you create this Job with the kubectl apply command, you should see it start up a pod immediately:

     kubectl get jobs

    kubectl get po

After the two minutes have passed, the pod will no longer show up in the pod list and the Job will be marked as completed.

The reason the pod isn’t deleted when it completes is to allow you to examine its logs; for example:

    kubectl logs batch-job-28qf4

The pod will be deleted when you delete it or the Job that created it. Before you do that, let’s look at the Job resource again:

    kubectl get job

The Job is shown as having completed successfully. But why is that piece of information shown as a number instead of as yes or true?

### Running multiple pod instances in a Job

Jobs may be configured to create more than one pod instance and run them in parallel or sequentially. This is done by setting the completions and the parallelism properties in the Job spec.