Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use etcdctl endpoint health as a etcd's livenessProbe #97034

Merged
merged 1 commit into from Dec 11, 2020

Conversation

mborsz
Copy link
Member

@mborsz mborsz commented Dec 3, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):

  • there is an active Alarm (there are two kinds of alarm: NOSPACE and CORRUPT)
  • there is no raft leader
  • the latency of a QGET request doesn't exceed 1s

The problem is that in most of those cases, etcd's restart isn't a right behavior:

  • if there is NOSPACE alarm, the restart will not free that space
  • if there is no raft leader
  • if the request latency > 1s, the etcd cluster is overloaded which is bad, but restart will generate even more load

The new livenessProbe, etctctl endpoint health checks the following condition (src)

  • checks if linearized (so using quorum) Get finishes in adjustable time

The etcd restart is usually very expensive and should be done only if etcd is permanently down anyway.
To achieve that, this PR changes the logic to:

  • call etcdctl endpoint health with 30s timeout
  • changes periodSecond to 30s (to accommodate increased timeout)
  • changes failureThreshold to 5

This basically means that if etcd was failing to get a key with 30s timeout, 5 times in a row (so in 2.5m time window), then we are going to restart it.

This is significantly stronger condition than the previous one (30 second window of >1s get latency) + avoids restarting etcd on alarms (such as NOSPACE or CORRUPT) where restart isn't right behavior, as read-only and delete calls should still work: https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota

Which issue(s) this PR fixes:

Fixes #96886

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/hold
Adding hold as I want to make sure that this works as I understood from the docs: i.e.: NOSPACE alarm will not make livenessProbe failing, lack of raft quorum will fail the probe.

/cc @wojtek-t @jpbetz @mm4tt @ptabor

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 3, 2020
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 3, 2020
@k8s-ci-robot
Copy link
Contributor

@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?
/kind bug

What this PR does / why we need it:
Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):

  • there is an active Alarm (there are two kinds of alarm: NOSPACE and CORRUPT)
  • there is no raft leader
  • the latency of a QGET request doesn't exceed 1s

The problem is that in most of those cases, etcd's restart isn't a right behavior:

  • if there is NOSPACE alarm, the restart will not free that space
  • if there is no raft leader
  • if the request latency > 1s, the etcd cluster is overloaded which is bad, but restart will generate even more load

The new livenessProbe, etctctl endpoint health checks the following condition (src)

  • checks if linearized (so using quorum) Get finishes in adjustable time

The etcd restart is usually very expensive and should be done only if etcd is permanently down anyway.
To achieve that, this PR changes the logic to:

  • call etcdctl endpoint health with 30s timeout
  • changes periodSecond to 30s (to accommodate increased timeout)
  • changes failureThreshold to 5

This basically means that if etcd was failing to get a key with 30s timeout, 5 times in a row (so in 2.5m time window), then we are going to restart it.

This is significantly stronger condition than the previous one (30 second window of >1s get latency) + avoids restarting etcd on alarms (such as NOSPACE or CORRUPT) where restart isn't right behavior, as read-only and delete calls should still work: https://etcd.io/docs/v3.4.0/op-guide/maintenance/#space-quota

Which issue(s) this PR fixes:

Fixes #96886

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/hold
Adding hold as I want to make sure that this works as I understood from the docs: i.e.: NOSPACE alarm will not make livenessProbe failing, lack of raft quorum will fail the probe.

/cc @wojtek-t @jpbetz @mm4tt @ptabor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 3, 2020
@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

Yes, NOSPACE alarm doesn't affect etcdctl endpoint health:

➜  ~ /usr/local/bin/etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 1.878945ms
➜  ~ /usr/local/bin/etcdctl alarm list
memberID:10276657743932975437 alarm:NOSPACE
➜  ~  /usr/local/bin/etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 1.773708ms

@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

And lack of quorum fails the health check - after stopping 2 out of 3 replicas I see in kubelet's logs:

ExecSync b38ec0c1b518287415f1f0f74bda410b409b51ef991701001ee6298ef9add7b0 '/usr/local/bin/etcdctl --endpoints=127.0.0.1:4002 --command-timeout=30s endpoint health' from runtime service failed: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 30s exceeded: context deadline exceeded

While it's not clear what should happen in this case (lack of quorum), as there may be many different potential reasons for this, this PR doesn't change this behavior significantly: it will still restart this member, but will wait 2 more minutes to as this can be potentially caused by some network connectivity issue or another etcd member being down where restart of this member may not be the best way of resolving this.

@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

/hold cancel

I think it's ready for review -- I'm happy to discuss about potential timeout values here. The reasoning behind 30s (so very high timeout) is to kill etcd only if it stopped responding to queries completely (i.e. we set this high enough so that if it doesn't respond in 30s, we don't have hope that it will respond at all).

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

/cc @ptabor

@k8s-ci-robot
Copy link
Contributor

@mborsz: GitHub didn't allow me to request PR reviews from the following users: ptabor.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @ptabor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

The test errors are for "main etcd" where we try to use etcd_livenessprobe_port which exposes only metrics. I will change this to use client port (likely solving some mtls certs issues)

@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

I changed to use a right port and mtls certs when necessary. I hope it will work now

@mborsz
Copy link
Member Author

mborsz commented Dec 3, 2020

/retest

"path": "/health"
"exec": {
"command": [
"/usr/local/bin/etcdctl",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the fact that etcdctl is installed in this image.
It's true:
https://github.com/kubernetes/kubernetes/blob/master/cluster/images/etcd/Dockerfile#L32

But it would be worth adding a comment maybe?

"command": [
"/bin/sh",
"-c",
"exec /usr/local/bin/etcdctl --endpoints=127.0.0.1:{{ port }} {{ etcdctl_cerds }} --command-timeout=30s endpoint health"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s timeout seems pretty high to me

Given that processing requests generally is subseconds, the fact we have conccurrent reads (that don't block us for long time) etc. I would actually try to go lower than that.

Is it blocked by defrag as an example?
Or are you trying to accomodate only for "overload" (if the latter, there is 5k in-flight limit anyway, so you will get 429 or whatever that is in such case). There is still a case of cpu-starvation though...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning behind 30s is that this is high enough so that if we don't receive answer by that time, it's unlikely we will ever receive any answer.

In current 5k performance tests we are seeing latency of ~5s when the etcd is overloaded so I wanted to put the thresholds significantly higher to avoid killing etcd even if it's more overloaded (if we are seeing successful responses with 5s latency, I can imagine that in some other overload scenario we will see e.g. 10s latency).

It is blocked by defrag (it was also before): https://etcd.io/docs/v3.4.0/op-guide/maintenance/#defragmentation

What value are you proposing instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about 5-10s. But given what you wrote above 5s is definitely too low.

How about 15s? If we're seeing 5s in overloaded cases, the 3x margin seems relatively safe maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15s sounds reasonable to me. I will change that

cluster/gce/manifests/etcd.manifest Show resolved Hide resolved
"timeoutSeconds": 15
"timeoutSeconds": 30,
"periodSeconds": 30,
"failureThreshold": 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5*30s is pretty long. I think either period&timeout or theshold should be lower...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's find a correct timeout first, then we will adjust the threshold or period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reduce period to 15 to match timeout and will keep threshold set to 5 which should translate to 75 seconds. Is it OK to you?

@k8s-ci-robot k8s-ci-robot added area/release-eng Issues or PRs related to the Release Engineering subproject sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 3, 2020
@@ -1717,7 +1717,8 @@ function prepare-etcd-manifest {
local etcd_apiserver_creds="${ETCD_APISERVER_CREDS:-}"
local etcd_extra_args="${ETCD_EXTRA_ARGS:-}"
local suffix="$1"
local etcd_livenessprobe_port="$2"
local etcd_listen_metrics_port="$2"
local etcdctl_cerds=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 'certs' ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got fixed.

@fedebongio
Copy link
Contributor

/assign @jingyih
/triage accepted
/cc @deads2k

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 3, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 3, 2020
},
"initialDelaySeconds": {{ liveness_probe_initial_delay }},
"timeoutSeconds": 15
"timeoutSeconds": 15,
"periodSeconds": 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the motivation to increase the periodSeconds (10s -> 15s). I'd propose the opposite, let's lower it to accommodate for the increased failureThreshold (3->5). Would you mind explaining your reasoning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to avoid overlapping probe attempts, i.e. not start the next probe if the previous one hasn't finished. While I increase timeout, I need to increase probe interval as well.

Separate, independent probe attempts provide us a better signal than overlapping ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the probe is as expensive as individual read, I think we could afford overlapping ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't work this way. It's not possible to have overlapping probes, see the code -

for w.doProbe() {
// Wait for next probe tick.
select {
case <-w.stopCh:
break probeLoop
case <-probeTicker.C:
// continue
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I changed this to 5s.

For posterity, the reason for that based on our experiments etcdctl endpoint health blocks for timeoutSecond in most of the unhealthy scenarios and that case it will require timeoutSecond * failureThreshold of etcd unhealthiness to trigger restart. The advantage of setting smaller value for probeSecond is that this reduces average time to detect first unhealthy probe (e.g. with probeSecond=15, if we finished some probe at time 0, and etcd becomes unhealthy at time 1, we will need to wait next 14 seconds before we start probing it, while with probeSecond we will start probing at time t=5).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@mborsz
Copy link
Member Author

mborsz commented Dec 8, 2020

/retest

@mborsz
Copy link
Member Author

mborsz commented Dec 8, 2020

I think it's ready for review.

@wojtek-t @jingyih WDYT?

@mborsz
Copy link
Member Author

mborsz commented Dec 11, 2020

/retest

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you.

cluster/gce/manifests/etcd.manifest Outdated Show resolved Hide resolved
Change-Id: Ie19c844050c75e3d1c4b431d09ba0ac851c5317b
@wojtek-t
Copy link
Member

/lgtm

Approving also based on @ptabor lgtm above.

/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mborsz, ptabor, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/gcp Issues or PRs related to gcp provider area/release-eng Issues or PRs related to the Release Engineering subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit etcd liveness probe
7 participants