Use etcdctl endpoint health as a etcd's livenessProbe #97034

mborsz · 2020-12-03T08:20:03Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):

there is an active Alarm (there are two kinds of alarm: NOSPACE and CORRUPT)
there is no raft leader
the latency of a QGET request doesn't exceed 1s

The problem is that in most of those cases, etcd's restart isn't a right behavior:

if there is NOSPACE alarm, the restart will not free that space
if there is no raft leader
if the request latency > 1s, the etcd cluster is overloaded which is bad, but restart will generate even more load

The new livenessProbe, etctctl endpoint health checks the following condition (src)

checks if linearized (so using quorum) Get finishes in adjustable time

The etcd restart is usually very expensive and should be done only if etcd is permanently down anyway.
To achieve that, this PR changes the logic to:

call etcdctl endpoint health with 30s timeout
changes periodSecond to 30s (to accommodate increased timeout)
changes failureThreshold to 5

This basically means that if etcd was failing to get a key with 30s timeout, 5 times in a row (so in 2.5m time window), then we are going to restart it.

This is significantly stronger condition than the previous one (30 second window of >1s get latency) + avoids restarting etcd on alarms (such as NOSPACE or CORRUPT) where restart isn't right behavior, as read-only and delete calls should still work: https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota

Which issue(s) this PR fixes:

Fixes #96886

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/hold
Adding hold as I want to make sure that this works as I understood from the docs: i.e.: NOSPACE alarm will not make livenessProbe failing, lack of raft quorum will fail the probe.

/cc @wojtek-t @jpbetz @mm4tt @ptabor

k8s-ci-robot · 2020-12-03T08:20:07Z

@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?
/kind bug

What this PR does / why we need it:
Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):

there is an active Alarm (there are two kinds of alarm: NOSPACE and CORRUPT)

there is no raft leader

the latency of a QGET request doesn't exceed 1s

The problem is that in most of those cases, etcd's restart isn't a right behavior:

if there is NOSPACE alarm, the restart will not free that space

if there is no raft leader

if the request latency > 1s, the etcd cluster is overloaded which is bad, but restart will generate even more load

The new livenessProbe, etctctl endpoint health checks the following condition (src)

checks if linearized (so using quorum) Get finishes in adjustable time

The etcd restart is usually very expensive and should be done only if etcd is permanently down anyway.
To achieve that, this PR changes the logic to:

call etcdctl endpoint health with 30s timeout

changes periodSecond to 30s (to accommodate increased timeout)

changes failureThreshold to 5

This basically means that if etcd was failing to get a key with 30s timeout, 5 times in a row (so in 2.5m time window), then we are going to restart it.

This is significantly stronger condition than the previous one (30 second window of >1s get latency) + avoids restarting etcd on alarms (such as NOSPACE or CORRUPT) where restart isn't right behavior, as read-only and delete calls should still work: https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota

Which issue(s) this PR fixes:

Fixes #96886

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/hold
Adding hold as I want to make sure that this works as I understood from the docs: i.e.: NOSPACE alarm will not make livenessProbe failing, lack of raft quorum will fail the probe.

/cc @wojtek-t @jpbetz @mm4tt @ptabor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mborsz · 2020-12-03T08:40:51Z

Yes, NOSPACE alarm doesn't affect etcdctl endpoint health:

➜  ~ /usr/local/bin/etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 1.878945ms
➜  ~ /usr/local/bin/etcdctl alarm list
memberID:10276657743932975437 alarm:NOSPACE
➜  ~  /usr/local/bin/etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 1.773708ms

mborsz · 2020-12-03T09:06:21Z

And lack of quorum fails the health check - after stopping 2 out of 3 replicas I see in kubelet's logs:

ExecSync b38ec0c1b518287415f1f0f74bda410b409b51ef991701001ee6298ef9add7b0 '/usr/local/bin/etcdctl --endpoints=127.0.0.1:4002 --command-timeout=30s endpoint health' from runtime service failed: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 30s exceeded: context deadline exceeded

While it's not clear what should happen in this case (lack of quorum), as there may be many different potential reasons for this, this PR doesn't change this behavior significantly: it will still restart this member, but will wait 2 more minutes to as this can be potentially caused by some network connectivity issue or another etcd member being down where restart of this member may not be the best way of resolving this.

mborsz · 2020-12-03T09:08:15Z

/hold cancel

I think it's ready for review -- I'm happy to discuss about potential timeout values here. The reasoning behind 30s (so very high timeout) is to kill etcd only if it stopped responding to queries completely (i.e. we set this high enough so that if it doesn't respond in 30s, we don't have hope that it will respond at all).

mborsz · 2020-12-03T09:08:38Z

/cc @ptabor

k8s-ci-robot · 2020-12-03T09:08:40Z

@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @ptabor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mborsz · 2020-12-03T09:57:15Z

The test errors are for "main etcd" where we try to use etcd_livenessprobe_port which exposes only metrics. I will change this to use client port (likely solving some mtls certs issues)

mborsz · 2020-12-03T13:11:13Z

I changed to use a right port and mtls certs when necessary. I hope it will work now

mborsz · 2020-12-03T13:27:24Z

/retest

wojtek-t · 2020-12-03T11:23:23Z

cluster/gce/manifests/etcd.manifest

-        "path": "/health"
+      "exec": {
+        "command": [
+          "/usr/local/bin/etcdctl",


This depends on the fact that etcdctl is installed in this image.
It's true:
https://github.com/kubernetes/kubernetes/blob/master/cluster/images/etcd/Dockerfile#L32

But it would be worth adding a comment maybe?

wojtek-t · 2020-12-03T13:28:02Z

cluster/gce/manifests/etcd.manifest

+        "command": [
+          "/bin/sh",
+          "-c",
+          "exec /usr/local/bin/etcdctl --endpoints=127.0.0.1:{{ port }} {{ etcdctl_cerds }} --command-timeout=30s endpoint health"


30s timeout seems pretty high to me

Given that processing requests generally is subseconds, the fact we have conccurrent reads (that don't block us for long time) etc. I would actually try to go lower than that.

Is it blocked by defrag as an example?
Or are you trying to accomodate only for "overload" (if the latter, there is 5k in-flight limit anyway, so you will get 429 or whatever that is in such case). There is still a case of cpu-starvation though...

The reasoning behind 30s is that this is high enough so that if we don't receive answer by that time, it's unlikely we will ever receive any answer.

In current 5k performance tests we are seeing latency of ~5s when the etcd is overloaded so I wanted to put the thresholds significantly higher to avoid killing etcd even if it's more overloaded (if we are seeing successful responses with 5s latency, I can imagine that in some other overload scenario we will see e.g. 10s latency).

It is blocked by defrag (it was also before): https://etcd.io/docs/v3.4.0/op-guide/maintenance/#defragmentation

What value are you proposing instead?

I was thinking about 5-10s. But given what you wrote above 5s is definitely too low.

How about 15s? If we're seeing 5s in overloaded cases, the 3x margin seems relatively safe maybe?

15s sounds reasonable to me. I will change that

cluster/gce/manifests/etcd.manifest

wojtek-t · 2020-12-03T13:29:01Z

cluster/gce/manifests/etcd.manifest

-      "timeoutSeconds": 15
+      "timeoutSeconds": 30,
+      "periodSeconds": 30,
+      "failureThreshold": 5


5*30s is pretty long. I think either period&timeout or theshold should be lower...

Let's find a correct timeout first, then we will adjust the threshold or period.

I will reduce period to 15 to match timeout and will keep threshold set to 5 which should translate to 75 seconds. Is it OK to you?

ptabor · 2020-12-03T13:54:51Z

cluster/gce/gci/configure-helper.sh

@@ -1717,7 +1717,8 @@ function prepare-etcd-manifest {
  local etcd_apiserver_creds="${ETCD_APISERVER_CREDS:-}"
  local etcd_extra_args="${ETCD_EXTRA_ARGS:-}"
  local suffix="$1"
-  local etcd_livenessprobe_port="$2"
+  local etcd_listen_metrics_port="$2"
+  local etcdctl_cerds=""


Should this be 'certs' ?

It got fixed.

fedebongio · 2020-12-03T21:07:13Z

/assign @jingyih
/triage accepted
/cc @deads2k

mm4tt · 2020-12-07T09:16:03Z

cluster/gce/manifests/etcd.manifest

      },
      "initialDelaySeconds": {{ liveness_probe_initial_delay }},
-      "timeoutSeconds": 15
+      "timeoutSeconds": 15,
+      "periodSeconds": 15,


I'm not sure if I understand the motivation to increase the periodSeconds (10s -> 15s). I'd propose the opposite, let's lower it to accommodate for the increased failureThreshold (3->5). Would you mind explaining your reasoning?

I'm trying to avoid overlapping probe attempts, i.e. not start the next probe if the previous one hasn't finished. While I increase timeout, I need to increase probe interval as well.

Separate, independent probe attempts provide us a better signal than overlapping ones.

If the probe is as expensive as individual read, I think we could afford overlapping ones.

I think it doesn't work this way. It's not possible to have overlapping probes, see the code -

kubernetes/pkg/kubelet/prober/worker.go

Lines 151 to 160 in 442a69c

for w.doProbe() {

// Wait for next probe tick.

select {

case <-w.stopCh:

break probeLoop

case <-probeTicker.C:

// continue

}

}

}

As discussed offline, I changed this to 5s.

For posterity, the reason for that based on our experiments etcdctl endpoint health blocks for timeoutSecond in most of the unhealthy scenarios and that case it will require timeoutSecond * failureThreshold of etcd unhealthiness to trigger restart. The advantage of setting smaller value for probeSecond is that this reduces average time to detect first unhealthy probe (e.g. with probeSecond=15, if we finished some probe at time 0, and etcd becomes unhealthy at time 1, we will need to wait next 14 seconds before we start probing it, while with probeSecond we will start probing at time t=5).

mborsz · 2020-12-08T11:16:08Z

/retest

mborsz · 2020-12-08T11:16:37Z

I think it's ready for review.

@wojtek-t @jingyih WDYT?

mborsz · 2020-12-11T10:27:27Z

/retest

ptabor

Looks good. Thank you.

cluster/gce/manifests/etcd.manifest

Change-Id: Ie19c844050c75e3d1c4b431d09ba0ac851c5317b

wojtek-t · 2020-12-11T12:41:00Z

/lgtm

Approving also based on @ptabor lgtm above.

/approve

k8s-ci-robot · 2020-12-11T12:41:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mborsz, ptabor, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/gce/gci/OWNERS~~ [wojtek-t]
~~cluster/gce/manifests/OWNERS~~ [wojtek-t]
~~cluster/images/etcd/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested a review from wojtek-t December 3, 2020 08:20

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 3, 2020

k8s-ci-robot requested review from jpbetz and mm4tt December 3, 2020 08:20

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 3, 2020

mborsz force-pushed the etcd-hc branch from 1a24ed5 to 9a59d06 Compare December 3, 2020 08:53

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020

mborsz force-pushed the etcd-hc branch from 9a59d06 to af8690c Compare December 3, 2020 12:21

wojtek-t reviewed Dec 3, 2020

View reviewed changes

mborsz force-pushed the etcd-hc branch from af8690c to f9030d8 Compare December 3, 2020 13:40

k8s-ci-robot added area/release-eng Issues or PRs related to the Release Engineering subproject sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 3, 2020

ptabor reviewed Dec 3, 2020

View reviewed changes

mborsz force-pushed the etcd-hc branch from f9030d8 to 87082c4 Compare December 3, 2020 15:48

k8s-ci-robot assigned jingyih Dec 3, 2020

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 3, 2020

k8s-ci-robot requested a review from deads2k December 3, 2020 21:07

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 3, 2020

mm4tt reviewed Dec 7, 2020

View reviewed changes

mborsz force-pushed the etcd-hc branch from 87082c4 to c8110fe Compare December 7, 2020 14:15

mborsz force-pushed the etcd-hc branch from c8110fe to ae540aa Compare December 9, 2020 09:12

ptabor approved these changes Dec 11, 2020

View reviewed changes

cluster/gce/manifests/etcd.manifest Outdated Show resolved Hide resolved

Migrate etcd's livenessProbe to etcdctl endpoint health.

7f09d59

Change-Id: Ie19c844050c75e3d1c4b431d09ba0ac851c5317b

mborsz force-pushed the etcd-hc branch from ae540aa to 7f09d59 Compare December 11, 2020 11:43

k8s-ci-robot assigned wojtek-t Dec 11, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2020

k8s-ci-robot merged commit c2369d9 into kubernetes:master Dec 11, 2020

k8s-ci-robot added this to the v1.21 milestone Dec 11, 2020

ptabor mentioned this pull request Mar 9, 2021

etcd container fails healthcheck probe due to context deadline exceeded etcd-io/etcd#12755

Closed

mm4tt mentioned this pull request Jun 17, 2021

Specify etcdctl version in livenessProbe #102952

Merged

randomvariable mentioned this pull request Sep 10, 2021

Provide a better liveness probe for when etcd runs as a Kubernetes pod etcd-io/etcd#13340

Closed

pacoxu mentioned this pull request Jun 24, 2022

kubeadm: use non-serializable startup probe for etcd pods #110744

Merged

mborsz mentioned this pull request Jul 21, 2022

Allow configuration of etcd healthcheck in kube-apiserver #111290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use etcdctl endpoint health as a etcd's livenessProbe #97034

Use etcdctl endpoint health as a etcd's livenessProbe #97034

mborsz commented Dec 3, 2020

k8s-ci-robot commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

k8s-ci-robot commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

wojtek-t Dec 3, 2020

wojtek-t Dec 3, 2020

mborsz Dec 3, 2020

wojtek-t Dec 3, 2020

mborsz Dec 3, 2020

wojtek-t Dec 3, 2020

mborsz Dec 3, 2020

mborsz Dec 3, 2020

ptabor Dec 3, 2020

ptabor Dec 11, 2020

fedebongio commented Dec 3, 2020

mm4tt Dec 7, 2020

mborsz Dec 7, 2020

ptabor Dec 7, 2020

mm4tt Dec 7, 2020

mborsz Dec 7, 2020

ptabor Dec 11, 2020

mborsz commented Dec 8, 2020

mborsz commented Dec 8, 2020

mborsz commented Dec 11, 2020

ptabor left a comment

wojtek-t commented Dec 11, 2020

k8s-ci-robot commented Dec 11, 2020

	for w.doProbe() {
	// Wait for next probe tick.
	select {
	case <-w.stopCh:
	break probeLoop
	case <-probeTicker.C:
	// continue
	}
	}
	}

Use etcdctl endpoint health as a etcd's livenessProbe #97034

Use etcdctl endpoint health as a etcd's livenessProbe #97034

Conversation

mborsz commented Dec 3, 2020

k8s-ci-robot commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

k8s-ci-robot commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

mborsz commented Dec 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedebongio commented Dec 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mborsz commented Dec 8, 2020

mborsz commented Dec 8, 2020

mborsz commented Dec 11, 2020

ptabor left a comment

Choose a reason for hiding this comment

wojtek-t commented Dec 11, 2020

k8s-ci-robot commented Dec 11, 2020