New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use etcdctl endpoint health as a etcd's livenessProbe #97034
Conversation
@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Yes, NOSPACE alarm doesn't affect etcdctl endpoint health:
|
And lack of quorum fails the health check - after stopping 2 out of 3 replicas I see in kubelet's logs:
While it's not clear what should happen in this case (lack of quorum), as there may be many different potential reasons for this, this PR doesn't change this behavior significantly: it will still restart this member, but will wait 2 more minutes to as this can be potentially caused by some network connectivity issue or another etcd member being down where restart of this member may not be the best way of resolving this. |
/hold cancel I think it's ready for review -- I'm happy to discuss about potential timeout values here. The reasoning behind 30s (so very high timeout) is to kill etcd only if it stopped responding to queries completely (i.e. we set this high enough so that if it doesn't respond in 30s, we don't have hope that it will respond at all). |
/cc @ptabor |
@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The test errors are for "main etcd" where we try to use etcd_livenessprobe_port which exposes only metrics. I will change this to use client port (likely solving some mtls certs issues) |
I changed to use a right port and mtls certs when necessary. I hope it will work now |
/retest |
cluster/gce/manifests/etcd.manifest
Outdated
"path": "/health" | ||
"exec": { | ||
"command": [ | ||
"/usr/local/bin/etcdctl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the fact that etcdctl is installed in this image.
It's true:
https://github.com/kubernetes/kubernetes/blob/master/cluster/images/etcd/Dockerfile#L32
But it would be worth adding a comment maybe?
cluster/gce/manifests/etcd.manifest
Outdated
"command": [ | ||
"/bin/sh", | ||
"-c", | ||
"exec /usr/local/bin/etcdctl --endpoints=127.0.0.1:{{ port }} {{ etcdctl_cerds }} --command-timeout=30s endpoint health" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30s timeout seems pretty high to me
Given that processing requests generally is subseconds, the fact we have conccurrent reads (that don't block us for long time) etc. I would actually try to go lower than that.
Is it blocked by defrag as an example?
Or are you trying to accomodate only for "overload" (if the latter, there is 5k in-flight limit anyway, so you will get 429 or whatever that is in such case). There is still a case of cpu-starvation though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning behind 30s is that this is high enough so that if we don't receive answer by that time, it's unlikely we will ever receive any answer.
In current 5k performance tests we are seeing latency of ~5s when the etcd is overloaded so I wanted to put the thresholds significantly higher to avoid killing etcd even if it's more overloaded (if we are seeing successful responses with 5s latency, I can imagine that in some other overload scenario we will see e.g. 10s latency).
It is blocked by defrag (it was also before): https://etcd.io/docs/v3.4.0/op-guide/maintenance/#defragmentation
What value are you proposing instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about 5-10s. But given what you wrote above 5s is definitely too low.
How about 15s? If we're seeing 5s in overloaded cases, the 3x margin seems relatively safe maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15s sounds reasonable to me. I will change that
"timeoutSeconds": 15 | ||
"timeoutSeconds": 30, | ||
"periodSeconds": 30, | ||
"failureThreshold": 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5*30s is pretty long. I think either period&timeout or theshold should be lower...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's find a correct timeout first, then we will adjust the threshold or period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reduce period to 15 to match timeout and will keep threshold set to 5 which should translate to 75 seconds. Is it OK to you?
cluster/gce/gci/configure-helper.sh
Outdated
@@ -1717,7 +1717,8 @@ function prepare-etcd-manifest { | |||
local etcd_apiserver_creds="${ETCD_APISERVER_CREDS:-}" | |||
local etcd_extra_args="${ETCD_EXTRA_ARGS:-}" | |||
local suffix="$1" | |||
local etcd_livenessprobe_port="$2" | |||
local etcd_listen_metrics_port="$2" | |||
local etcdctl_cerds="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 'certs' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got fixed.
cluster/gce/manifests/etcd.manifest
Outdated
}, | ||
"initialDelaySeconds": {{ liveness_probe_initial_delay }}, | ||
"timeoutSeconds": 15 | ||
"timeoutSeconds": 15, | ||
"periodSeconds": 15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand the motivation to increase the periodSeconds (10s -> 15s). I'd propose the opposite, let's lower it to accommodate for the increased failureThreshold (3->5). Would you mind explaining your reasoning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to avoid overlapping probe attempts, i.e. not start the next probe if the previous one hasn't finished. While I increase timeout, I need to increase probe interval as well.
Separate, independent probe attempts provide us a better signal than overlapping ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the probe is as expensive as individual read, I think we could afford overlapping ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't work this way. It's not possible to have overlapping probes, see the code -
kubernetes/pkg/kubelet/prober/worker.go
Lines 151 to 160 in 442a69c
for w.doProbe() { | |
// Wait for next probe tick. | |
select { | |
case <-w.stopCh: | |
break probeLoop | |
case <-probeTicker.C: | |
// continue | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I changed this to 5s.
For posterity, the reason for that based on our experiments etcdctl endpoint health blocks for timeoutSecond in most of the unhealthy scenarios and that case it will require timeoutSecond * failureThreshold of etcd unhealthiness to trigger restart. The advantage of setting smaller value for probeSecond is that this reduces average time to detect first unhealthy probe (e.g. with probeSecond=15, if we finished some probe at time 0, and etcd becomes unhealthy at time 1, we will need to wait next 14 seconds before we start probing it, while with probeSecond we will start probing at time t=5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you.
Change-Id: Ie19c844050c75e3d1c4b431d09ba0ac851c5317b
/lgtm Approving also based on @ptabor lgtm above. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mborsz, ptabor, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):
The problem is that in most of those cases, etcd's restart isn't a right behavior:
The new livenessProbe, etctctl endpoint health checks the following condition (src)
The etcd restart is usually very expensive and should be done only if etcd is permanently down anyway.
To achieve that, this PR changes the logic to:
This basically means that if etcd was failing to get a key with 30s timeout, 5 times in a row (so in 2.5m time window), then we are going to restart it.
This is significantly stronger condition than the previous one (30 second window of >1s get latency) + avoids restarting etcd on alarms (such as NOSPACE or CORRUPT) where restart isn't right behavior, as read-only and delete calls should still work: https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota
Which issue(s) this PR fixes:
Fixes #96886
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/hold
Adding hold as I want to make sure that this works as I understood from the docs: i.e.: NOSPACE alarm will not make livenessProbe failing, lack of raft quorum will fail the probe.
/cc @wojtek-t @jpbetz @mm4tt @ptabor