Fix fluentd-gcp addon liveness probe #74522

Pluies · 2019-02-25T12:06:36Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix three issues with the fluentd-gcp liveness probe:

- STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS
if defined

Probably a copy/paste issue introduced in edf1ffc

- [[ is a bashism, and will always failed when called with /bin/sh

Introduced by a844523

Given that we call the liveness probe with /bin/sh, we cannot use the
double-bracketed [[ syntax for test, as it is not POSIX-compliant and
will throw an error.

Annoyingly, even through it prints an error, sh returns with exit code 0
in this case:

root@fluentd-7mprs:/# sh liveness.sh
liveness.sh: 8: liveness.sh: [[: not found
liveness.sh: 15: liveness.sh: [[: not found
root@fluentd-7mprs:/# echo $?
0

Which means the liveness probe is considered successful by Kubernetes,
despite failing to test things as it was intended. This is also
probably the reason why this bug wasn't reported sooner :)

Thankfully, the test in this case can just as easily be written as
POSIX-compliant as it doesn't use any bash-specific features within the
[[ block.

- Buffers are transient and cannot be relied upon for monitoring

Finally, after fixing the above issue, we started seeing the fluentd
containers being restarted very often, and found an issue with the
underlying logic of the liveness probe.

The probe checks that the pod is still alive by running the following
command:

find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit

This checks if any regular file exists under /var/log/fluentd-buffers
that is more recent than a predetermined time, and will return an empty
string otherwise.

The issue is that these buffers are temporary and volatile, they get created and
deleted constantly. Here is an example of running that check every second on a
running fluentd:

root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900};
root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck;
root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:52:57 UTC 2019
Fri Feb 22 10:52:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:52:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:53:00 UTC 2019
Fri Feb 22 10:53:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:03 UTC 2019
Fri Feb 22 10:53:04 UTC 2019
Fri Feb 22 10:53:05 UTC 2019
Fri Feb 22 10:53:06 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:07 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:08 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:09 UTC 2019
Fri Feb 22 10:53:10 UTC 2019
Fri Feb 22 10:53:11 UTC 2019
Fri Feb 22 10:53:12 UTC 2019
Fri Feb 22 10:53:13 UTC 2019
Fri Feb 22 10:53:14 UTC 2019
Fri Feb 22 10:53:15 UTC 2019
Fri Feb 22 10:53:16 UTC 2019

We can see buffers being created, then disappearing. The LivenessProbe running
under these conditions has a ~50% chance of failing, despite fluentd being
perfectly happy.

I believe that check is probably ok for fluentd installs using large
amounts of buffers, in which case the liveness probe will be correct more
often than not, but fluentd installs that use buffering less intensively
will be negatively impacted by this.

My solution to fix this is to check the last updated time of buffering
folders within /var/log/fluentd_buffers. These do get updated when
buffers are created, and do not get deleted as buffers are emptied,
making them the perfect candidate for our use.

Here's an example with the -d flag for directories:

root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:57:51 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:52 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:53 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:54 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:55 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:56 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:57 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:00 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:03 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer

And example of the directory being updated as new buffers come in:

root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:17 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 16K
drwxr-xr-x 2 root root  224 Feb 22 11:18 .
drwxr-xr-x 3 root root   38 Feb 22 11:14 ..
-rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log
-rw-r--r-- 1 root root  215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta
-rw-r--r-- 1 root root  429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log
-rw-r--r-- 1 root root  195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..

Which issue(s) this PR fixes:

Special notes for your reviewer:

/sig gcp

Does this PR introduce a user-facing change?:

- Fix liveness probe in fluentd-gcp cluster addon

Fix three issues with the fluentd-gcp liveness probe: h1. STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS if defined Probably a copy/paste issue introduced in edf1ffc h1. `[[` is [a bashism](https://stackoverflow.com/a/47576482), and will always failed when called with `/bin/sh` Introduced by a844523 Given that we call the liveness probe with `/bin/sh`, we cannot use the double-bracketed `[[` syntax for test, as it is not POSIX-compliant and will throw an error. Annoyingly, even through it prints an error, `sh` returns with exit code 0 in this case: ```bash root@fluentd-7mprs:/# sh liveness.sh liveness.sh: 8: liveness.sh: [[: not found liveness.sh: 15: liveness.sh: [[: not found root@fluentd-7mprs:/# echo $? 0 ``` Which means the liveness probe is considered successful by Kubernetes, despite failing to test things as it was intended. This is also probably the reason why this bug wasn't reported sooner :) Thankfully, the test in this case can just as easily be written as POSIX-compliant as it doesn't use any bash-specific features within the `[[` block. h1. Buffers are transient and cannot be relied upon for monitoring Finally, after fixing the above issue, we started seeing the fluentd containers being restarted very often, and found an issue with the underlying logic of the liveness probe. The probe checks that the pod is still alive by running the following command: `find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit` This checks if any _regular_ file exists under `/var/log/fluentd-buffers` that is more recent than a predetermined time, and will return an empty string otherwise. The issue is that these buffers are temporary and volatile, they get created and deleted constantly. Here is an example of running that check every second on a running fluentd: ``` root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done Fri Feb 22 10:52:57 UTC 2019 Fri Feb 22 10:52:58 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log Fri Feb 22 10:52:59 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log Fri Feb 22 10:53:00 UTC 2019 Fri Feb 22 10:53:01 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log Fri Feb 22 10:53:02 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log Fri Feb 22 10:53:03 UTC 2019 Fri Feb 22 10:53:04 UTC 2019 Fri Feb 22 10:53:05 UTC 2019 Fri Feb 22 10:53:06 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log Fri Feb 22 10:53:07 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log Fri Feb 22 10:53:08 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log Fri Feb 22 10:53:09 UTC 2019 Fri Feb 22 10:53:10 UTC 2019 Fri Feb 22 10:53:11 UTC 2019 Fri Feb 22 10:53:12 UTC 2019 Fri Feb 22 10:53:13 UTC 2019 Fri Feb 22 10:53:14 UTC 2019 Fri Feb 22 10:53:15 UTC 2019 Fri Feb 22 10:53:16 UTC 2019 ``` We can see buffers being created, then disappearing. The LivenessProbe running under these conditions has a ~50% chance of failing, despite fluentd being perfectly happy. I believe that check is probably ok for fluentd installs using large amounts of buffers, in which case the liveness probe will be correct more often than not, but fluentd installs that use buffering less intensively will be negatively impacted by this. My solution to fix this is to check the last updated time of buffering _folders_ within `/var/log/fluentd_buffers`. These _do_ get updated when buffers are created, and do not get deleted as buffers are emptied, making them the perfect candidate for our use. Here's an example with the `-d` flag for directories: ``` root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done Fri Feb 22 10:57:51 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:52 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:53 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:54 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:55 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:56 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:57 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:58 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:57:59 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:58:00 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:58:01 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:58:02 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer Fri Feb 22 10:58:03 UTC 2019 /var/log/fluentd-buffers/kubernetes.system.buffer ``` And example of the directory being updated as new buffers come in: ``` root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer total 0 drwxr-xr-x 2 root root 6 Feb 22 11:17 . drwxr-xr-x 3 root root 38 Feb 22 11:14 .. root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer total 16K drwxr-xr-x 2 root root 224 Feb 22 11:18 . drwxr-xr-x 3 root root 38 Feb 22 11:14 .. -rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log -rw-r--r-- 1 root root 215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta -rw-r--r-- 1 root root 429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log -rw-r--r-- 1 root root 195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer total 0 drwxr-xr-x 2 root root 6 Feb 22 11:18 . drwxr-xr-x 3 root root 38 Feb 22 11:14 .. ```

k8s-ci-robot · 2019-02-25T12:06:43Z

Hi @Pluies. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

x13n · 2019-02-25T12:22:10Z

/ok-to-test

x13n · 2019-02-26T12:17:44Z

@bmoyles0117 FYI

/lgtm
/approve

k8s-ci-robot · 2019-02-26T12:18:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Pluies, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/addons/fluentd-gcp/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 25, 2019

k8s-ci-robot requested review from piosz and x13n February 25, 2019 12:07

k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Feb 25, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 25, 2019

k8s-ci-robot assigned x13n Feb 26, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2019

k8s-ci-robot merged commit b8ddc79 into kubernetes:master Feb 26, 2019

Pluies mentioned this pull request Sep 8, 2020

REQUEST: New membership for Pluies kubernetes/org#2181

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fluentd-gcp addon liveness probe #74522

Fix fluentd-gcp addon liveness probe #74522

Pluies commented Feb 25, 2019

k8s-ci-robot commented Feb 25, 2019

x13n commented Feb 25, 2019

x13n commented Feb 26, 2019

k8s-ci-robot commented Feb 26, 2019

Fix fluentd-gcp addon liveness probe #74522

Fix fluentd-gcp addon liveness probe #74522

Conversation

Pluies commented Feb 25, 2019

k8s-ci-robot commented Feb 25, 2019

x13n commented Feb 25, 2019

x13n commented Feb 26, 2019

k8s-ci-robot commented Feb 26, 2019