## Monitoring and Self-Healing with Probes

- On each node in a kubernetes cluster, the **kubelet** monitors the **health of Pods** and re-starts them when necessary.

<img src=../notebook_images/pod_crash_restart.png width="600" height="200" style="margin: 1em" />

- But the **kubelet** can't determine the **health of a container running inside a Pod** without additional help.

<img src=../notebook_images/container_crash_no_restart.png width="600" height="200" style="margin: 1em" />

- In the **containers** section of a Pod (or Pod template) definition, each container can define three different types of **Probes**.
  - <span style="color:#DDDD00;font-weight:bold">startupProbe</span> : used to determine if the container's microservice is considered initialzed and **started**.
    - A **failed startup probe** will **restart the container**.
  - <span style="color:#00C800;font-weight:bold">readinessProbe</span> : used to determine if the container's microservice is considered  **ready to accept traffic**.
    - A **failed readiness probe** will **stop the container from receiving traffic** (it will not restart the container).
  - <span style="color:#6688FF;font-weight:bold">livenessProbe</span> : used to determine if the container's microservice is considered **alive**.
    - A **failed liveness probe** will **restart the container**.

<img src=../notebook_images/pod_with_probes.png width="900" height="500" style="margin: 1em" />

- The **kubelet** will **use the probes to determine a container's health** (or more importantly, the **health of the microservice** it contains).
  - The **kubelet** checks a container periodically using the configured **probes**.
  - The **kubelet** will **restart a Pod** who's container's **liveness proble** is considered **failed**.

<img src=../notebook_images/kubelet_container_probing.png width="300" height="150" style="margin: 1em" />

- Each probe can be configured to use one of three **probe actions**.
  - <span style="color:#DDDD00;font-weight:bold">exec</span> action will execute a command inside the container.
    - The action **succeeds** if the commands **exit code is 0**.
    - The action **fails** if the commands **exit code is not 0**.
  - <span style="color:#00C800;font-weight:bold">tcpSocket</span> action will Check if a TCP port is open in the container.
    - The action **succeeds** if the TCP **port is open**.
    - The action **fails** if the TCP **port is not open**.
  - <span style="color:#6688FF;font-weight:bold">httpGet</span> action will issue an HTTP GET request against a specific port and path in the container.
    - The action **succeeds** if the **HTTP response code is between 200 and 400** (e.g. **200**).
    - The action **fails** if the **HTTP response code is not between 200 and 400** (e.g. **500**).

<img src=../notebook_images/pod_with_probe_actions.png width="800" height="500" style="margin: 1em" />

- Also notice the following settings each probe can define:
  - **initialDelaySeconds** is the initial delay in seconds between the container starting and the probing commencing.
  - **periodSeconds** is the time in seconds between each successive probe.
  - **failureThreshold** is the number of times the probe can be unsuccessful before the probe is considered failed.
    - E.g. `initialDelaySeconds: 5`. `periodSeconds: 2`, `failureThreshold: 3` means:
      - The probing by the `kubelet` commences `5` seconds after the container starts.
      - The `kubelet` will run the probe every `2` seconds after the initial delay.
      - If the probe is unsuccessful `3` times, the probe is considered failed.
      - The first time the probe is run is `initialDeplySeconds + periodSeconds`, i.e. `5 + 2 = 7` seconds.
  - Note that:
    - The first time a probe is run is `initialDeplySeconds + periodSeconds`.
    - The **readinessProbe** and the **livenessProbe** won't run until the **startupProbe** succeeds.
    - A probe is considered **failed** after **failureThreashold** number of **unsuccessful attempts**.
    - A failed **startupProbe** or **livenessProbe** will **restart** the Pod.
    - A failed **readinessProbe** will **not restart** the Pod, but will **stop traffic** being sent to the Pod.

## Ensure a Kubenetes cluster is running

- Use any Kubernetes cluster.
  - In this example, a Minikube cluster with 3 nodes is used: `minikube start --nodes 3`.

## Start watching Events produced by the `liveness-example` Pod

- Run this command in a terminal (it won't work from a notebook cell).
- The command will continuously:
  -  Listen for Events produced by a resource named `liveness-example`.
  -  Output a new line to the terminal with Event information every time an Event occurs.

  ```bash
  # Monitor (watch) events produced by the "liveness-example" Pod
  kubectl get events --watch --field-selector involvedObject.name=liveness-example
  ```

## Deploy Pod called `liveness-example`

- The Pod definition is in the file `manifests/liveness-example`.
- It defines a Pod named `liveness-example`.
- It uses the `busybox` image, and runs the command below when the container starts
  - The command will:
    -  Create the file `/tmp/healthy` (in the container).
    -  Sleep for `15 seconds`.
    -  Delete the file `/tmp/healthy`.
    -  Sleep for `3600 seconds`.

  ```bash
  args:
      - /bin/sh
      - -c
      - touch /tmp/healthy; sleep 15; rm -rf /tmp/healthy; sleep 3600
  ```

- It defines a **livenessProbe** as below:
  - The first time the probe runs is `initialDelaySeconds + periodSeconds = 5 + 5 = 10` seconds after the container starts.
  - After the initial probe has been run, the probe runs every `periodSeconds = 5` seconds.
  - he probe uses the `exec` action to execute a command in the container-
  - Each time the probe is run, the command tries to list the contents of the file `/tmp/healthy` (in the container).
    - If the command's exit code is 0 (i.e. it managed to list the file's contents), the probe attempt is sucessful.
    - If the command's exit code is not 0 (i.e. it failed to list the file's contents), the probe attempt is unsucessful.
  - If `failureThreshold = 2` probling attempts are unsucessful, the probe is considered a failure.
    - This will cause the `kubelet` to restart the Pod.

  ```bash
  livenessProbe:
        exec:
          command:
          - cat
          - /tmp/healthy
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 2
  ```

In [1]:
!kubectl apply -f manifests/liveness-example.yaml

pod/liveness-example created


## Check the Events produced by `liveness-example` Pod in the terminal

- You should see something similar to the output below
  - First the `busybox` image is pulled from DockerHub.
  - Then the `busybox` container `liveness` is created and started.
    - At first the probe is successful, since the probe command is successful.
    - This is because the `busybox` container has created the the file `/tmp/healthy`.
  - Eventually the `livenessProbe` fails **twice** (`failureThreshold: 2`)
    - The error message from the probe command run in the container is: `cat: can't open '/tmp/healthy': No such file`.
    - This is because the `busybox` container has deleted the file `/tmp/healthy` after `15` seconds-
  -  The `kubelet` restarts the Pod.
  - This pattern then keeps repeating.

```bash
0s  Normal   Scheduled  pod/liveness-example  Successfully assigned default/liveness-example to minikube-m02

0s  Normal   Pulling    pod/liveness-example  Pulling image "busybox"
0s  Normal   Pulled     pod/liveness-example  Successfully pulled image "busybox" in 1.349s
0s  Normal   Created    pod/liveness-example  Created container liveness
0s  Normal   Started    pod/liveness-example  Started container liveness
0s  Warning  Unhealthy  pod/liveness-example  Liveness probe failed: cat: can't open '/tmp/healthy': No such file
0s  Warning  Unhealthy  pod/liveness-example  Liveness probe failed: cat: can't open '/tmp/healthy': No such file
0s  Normal   Killing    pod/liveness-example  Container liveness failed liveness probe, will be restarted

0s  Normal   Pulling    pod/liveness-example  Pulling image "busybox"
0s  Normal   Pulled     pod/liveness-example  Successfully pulled image "busybox" in 1.16s
0s  Normal   Created    pod/liveness-example  Created container liveness
0s  Normal   Started    pod/liveness-example  Started container liveness
0s  Warning  Unhealthy  pod/liveness-example  Liveness probe failed: cat: can't open '/tmp/healthy': No such file
0s  Warning  Unhealthy  pod/liveness-example  Liveness probe failed: cat: can't open '/tmp/healthy': No such file
0s  Normal   Killing    pod/liveness-example  Container liveness failed liveness probe, will be restarted
```

## Describe the Pod (detailed Pod information)

- The description of a Pod contains similar `event` information as above at the bottom of its listing.
- You should see something similar to the output below.


  ```bash
  Events:
    Type     Reason     Age                From               Message
    ----     ------     ----               ----               -------
    Normal   Scheduled  37s                default-scheduler  Successfully assigned default/liveness-example to minikube-m02
    Normal   Pulling    37s                kubelet            Pulling image "busybox"
    Normal   Pulled     36s                kubelet            Successfully pulled image "busybox" in 1.2s
    Normal   Created    36s                kubelet            Created container liveness
    Normal   Started    36s                kubelet            Started container liveness
    Warning  Unhealthy  12s (x2 over 17s)  kubelet            Liveness probe failed: cat: can't open '/tmp/healthy'
    Normal   Killing    12s                kubelet            Container liveness failed liveness probe, will be restarted
  ```

In [2]:
!kubectl describe pod liveness-example

Name:             liveness-example
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube-m02/192.168.49.3
Start Time:       Wed, 14 Feb 2024 13:21:03 +0100
Labels:           test=liveness
Annotations:      <none>
Status:           Running
IP:               10.244.1.4
IPs:
  IP:  10.244.1.4
Containers:
  liveness:
    Container ID:  docker://a86b9fd70cbf0060aa785b4d3d95b22c0afd0413c60338ec8fc674686e94528d
    Image:         busybox
    Image ID:      docker-pullable://busybox@sha256:6d9ac9237a84afe1516540f40a0fafdc86859b2141954b4d643af7066d598b74
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/sh
      -c
      touch /tmp/healthy; sleep 15; rm -rf /tmp/healthy; sleep 3600
    State:          Running
      Started:      Wed, 14 Feb 2024 13:21:05 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     250m
      memory:  256Mi
    Requests:
      cpu:        100m
      memory:     128Mi
    Liven

## Check the Events for Pod `liveness-example` in K9s

- Open K9s in a new Terminal by typing `k9s` in the terminal.
- Type `:events /liveness-example` to open the Events view for the `liveness-example` Pod.
  - Press `shift + l` to sort the events in descreasing order (`LastSeen`).
- Type `:popeye` to open "Popeye" with a status overview of cluster resources.
  - Notice the `liveness-example` Pod is failing (`ERROR`).
- Type `:pods` to open the Pods view.
  - Notice the `STATUS` for the `liveness-example` Pod is `CrashLoopBackoff`.
- Press `ctrl + c` to close K9s.

## Delete the Pod

In [3]:
!kubectl delete -f manifests/liveness-example.yaml --force --grace-period=0

pod "liveness-example" force deleted


## Stop watching for events produced by the `liveness-example` Pod

- Press `ctrl + c` in the terminal watching for events.

## Deploy the Pod called `liveness-example` again

In [4]:
!kubectl apply -f manifests/liveness-example.yaml

pod/liveness-example created


## Watch Pod information

- You can also use the `--watch` option with the `kubectl get` command.
- Run the command below in a new terminal (it won't work in a notebook cell).
  - Notice the command will output a new line every time information about the Pod changes.
    - The `kubelet` will try restarting the Pod after the `livenessProbe` fails.
    - Eventually the Pod's `STATUS` will change to `CrashLoopBackOff` after to many failed restarts.
    - This pattern repeats.

  ```bash
  kubectl get pod liveness-example -o wide --watch
  ```

  - You should see something similar to the output below.

    ```bash
    NAME               READY   STATUS              RESTARTS        AGE     IP           NODE
    liveness-example   0/1     ContainerCreating   0                  2s   <none>       minikube-m02
    liveness-example   1/1     Running             0                  2s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             1 (2s ago)        57s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             2 (2s ago)       112s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             3 (2s ago)      2m47s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             4 (2s ago)      3m42s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             5 (2s ago)      4m37s   10.244.1.3   minikube-m02
    liveness-example   0/1     CrashLoopBackOff    5 (0s ago)      5m30s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             6 (83s ago)     6m53s   10.244.1.3   minikube-m02
    liveness-example   0/1     CrashLoopBackOff    6 (0s ago)      7m45s   10.244.1.3   minikube-m02
    liveness-example   1/1     Running             7 (2m47s ago)   10m     10.244.1.3   minikube-m02
    liveness-example   0/1     CrashLoopBackOff    7 (1s ago)      11m     10.244.1.3   minikube-m02
    ``` 

## Delete the Pod

In [5]:
!kubectl delete -f manifests/liveness-example.yaml --force --grace-period=0

pod "liveness-example" force deleted


## Stop watching information for the `liveness-example` Pod

- Press `ctrl + c` in the terminal watching for Pod information.