Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: use non-serializable startup probe for etcd pods #110744

Merged

Conversation

neolit123
Copy link
Member

@neolit123 neolit123 commented Jun 23, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

As per the etcd maintainers' recommendation - startup probes
shouldn't be serialized, while the liveness probes should be.

etcd-io/etcd#14048 (comment)

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#2567

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: make sure the etcd static pod startup probe uses /health?serializable=false while the liveness probe uses /health?serializable=true&exclude=NOSPACE. The NOSPACE exclusion would allow administrators to address space issues one member at a time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 23, 2022
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2022
@neolit123
Copy link
Member Author

neolit123 commented Jun 23, 2022

/triage accepted
/priority important-longterm

/cc @ahrtr

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 23, 2022

@neolit123: GitHub didn't allow me to request PR reviews from the following users: ahrtr.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/triage accepted
/priority important-longterm

/cc @ahrtr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 23, 2022
@k8s-ci-robot k8s-ci-robot requested review from pacoxu and RA489 Jun 23, 2022
@neolit123
Copy link
Member Author

neolit123 commented Jun 23, 2022

@ahrtr looks like you are not a kubernetes member yet, but please LGTM formally in a comment if you can.
@pacoxu PTAL for /lgtm label.

thanks

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2022
@ahrtr
Copy link
Contributor

ahrtr commented Jun 23, 2022

Looks good to me, but probably you need to add at least one test case?

@ahrtr
Copy link
Contributor

ahrtr commented Jun 23, 2022

/lgtm

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 23, 2022

@ahrtr: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu
Copy link
Member

pacoxu commented Jun 24, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2022
@pacoxu
Copy link
Member

pacoxu commented Jun 24, 2022

I checked the cluster etcd configuration in cluster/gce/manifests/etcd.manifest.

#97034 changed the liveness probe to use etcdctl endpoints status.

"livenessProbe": {
"exec": {
"command": [
"/bin/sh",
"-c",
"set -x; exec /usr/local/bin/etcdctl --endpoints=127.0.0.1:{{ port }} {{ etcdctl_certs }} --command-timeout=15s endpoint health"
]
},
"initialDelaySeconds": {{ liveness_probe_initial_delay }},
"timeoutSeconds": 15,
"periodSeconds": 5,
"failureThreshold": 5
},

Without this PR, the current etcd's livenessProbe is using /health endpoint fails if one of the conditions is met (src):

  • there is an active Alarm (there are two kinds of alarm: NOSPACE and CORRUPT)
  • there is no raft leader
  • the latency of a QGET request doesn't exceed 1s

The problem is that in most of those cases, etcd's restart isn't a right behavior:

  • if there is NOSPACE alarm, the restart will not free that space
  • if there is no raft leader
  • if the request latency > 1s, the etcd cluster is overloaded which is bad, but restart will generate even more load

The new livenessProbe, etctctl endpoint health checks the following condition (src)

  • checks if linearized (so using quorum) Get finishes in adjustable time

Does health?serializable=true check the alarm like NOSPACE?

/health?serializable=true, so that K8s only restarts a POD when the local etcd member isn't healthy.

@pacoxu
Copy link
Member

pacoxu commented Jun 24, 2022

[root@paco ~]# curl 127.0.0.1:2381/health
{"health":"false","reason":"ALARM NOSPACE"}
[root@paco ~]# curl 127.0.0.1:2381/health?serializable=true
{"health":"false","reason":"ALARM NOSPACE"}
[root@paco ~]# curl 127.0.0.1:2381/health?serializable=false
{"health":"false","reason":"ALARM NOSPACE"}

For the NOSPACE alarm, /health keeps its behavior as #97034 mentions.

@ahrtr
Copy link
Contributor

ahrtr commented Jun 24, 2022

The benefit of using etcdctl is users can specify certificate, but it seems that the [startup|liveness]Probes do not support it.

@ahrtr
Copy link
Contributor

ahrtr commented Jun 24, 2022

[root@paco ~]# curl 127.0.0.1:2381/health
{"health":"false","reason":"ALARM NOSPACE"}
[root@paco ~]# curl 127.0.0.1:2381/health?serializable=true
{"health":"false","reason":"ALARM NOSPACE"}
[root@paco ~]# curl 127.0.0.1:2381/health?serializable=false
{"health":"false","reason":"ALARM NOSPACE"}

For the NOSPACE alarm, /health keeps its behavior as #97034 mentions.

Yes, there is no change on this. And users can exclude some alarm using the parameter exclude, see health.go#L125

@@ -204,8 +203,8 @@ func GetEtcdPodSpec(cfg *kubeadmapi.ClusterConfiguration, endpoint *kubeadmapi.A
v1.ResourceMemory: resource.MustParse("100Mi"),
},
},
LivenessProbe: staticpodutil.LivenessProbe(probeHostname, etcdHealthEndpoint, probePort, probeScheme),
StartupProbe: staticpodutil.StartupProbe(probeHostname, etcdHealthEndpoint, probePort, probeScheme, cfg.APIServer.TimeoutForControlPlane),
LivenessProbe: staticpodutil.LivenessProbe(probeHostname, "/health?serializable=true", probePort, probeScheme),
Copy link
Member

@pacoxu pacoxu Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LivenessProbe: staticpodutil.LivenessProbe(probeHostname, "/health?serializable=true", probePort, probeScheme),
LivenessProbe: staticpodutil.LivenessProbe(probeHostname, "/health?exclude=NOSPACE&serializable=true", probePort, probeScheme),

@ahrtr so /health?exclude=NOSPACE&serializable=true would be better here.

Copy link
Member

@pacoxu pacoxu Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about if we need the \

curl -i -v -G http://127.0.0.1:2381/health?serializable=false\&exclude=NOSPACE
[root@paco ~]# curl -i -v -G http://127.0.0.1:2381/health?serializable=false\&exclude=NOSPACE
*   Trying 127.0.0.1:2381...
* Connected to 127.0.0.1 (127.0.0.1) port 2381 (#0)
> GET /health?serializable=false&exclude=NOSPACE HTTP/1.1
> Host: 127.0.0.1:2381
> User-Agent: curl/7.61.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Mon, 27 Jun 2022 10:31:36 GMT
Date: Mon, 27 Jun 2022 10:31:36 GMT
< Content-Length: 29
Content-Length: 29
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8

<
* Connection #0 to host 127.0.0.1 left intact
{"health":"true","reason":""}[root@paco ~]#
[root@paco ~]#
[root@paco ~]# curl -i -v -G http://127.0.0.1:2381/health?serializable=false&exclude=NOSPACE
[1] 27110
[root@paco ~]# *   Trying 127.0.0.1:2381...
* Connected to 127.0.0.1 (127.0.0.1) port 2381 (#0)
> GET /health?serializable=false HTTP/1.1
> Host: 127.0.0.1:2381
> User-Agent: curl/7.61.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< Date: Mon, 27 Jun 2022 10:31:41 GMT
Date: Mon, 27 Jun 2022 10:31:41 GMT
< Content-Length: 44
Content-Length: 44

<
{"health":"false","reason":"ALARM NOSPACE"}
* Connection #0 to host 127.0.0.1 left intact 

If no \, curl will send only the first param.

Copy link
Contributor

@ahrtr ahrtr Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need ' \ ' here. The library can handle it automatically, no need to worry about it.

Please try the URL in browser, and browser can handle it, just in the same way as the golang library does.

Copy link
Member

@pacoxu pacoxu Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I test it, I find there is a thing. etcd will not be killed after the first kill.

In the past, if there is NOSPACE alert, then pod's liveness will fail for /health.
Then etcd pod will be started again. The startup probe will never pass. So no liveness probe will be triggered.

Pod will keep running like below

[root@paco ~]# kubectl get events  -n kube-system   -w| grep etcd
13m         Warning   Unhealthy        pod/etcd-paco                   Startup probe failed: HTTP probe failed with statuscode: 503
12m         Normal    Killing          pod/etcd-paco                   Stopping container etcd

[root@paco ~]# kubectl get pod  -n kube-system  etcd-paco
NAME        READY   STATUS    RESTARTS   AGE
etcd-paco   1/1     Running   0          13m

Copy link
Member

@pacoxu pacoxu Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the difference

  • adding exclude=NOSPACE: no restart if it is nospace
  • without exclude=NOSPACE: one restart

Copy link
Contributor

@ahrtr ahrtr Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected behavior for /health?exclude=NOSPACE&serializable=true. It depends on what's your expected behavior.

With exclude=NOSPACE

  • Good side: etcd can still serve range/read requests. Administrator can fix the space issue for each member one by one, and the etcd cluster can always serve range/read requests during the process.
  • Bad side: The member (even the whole cluster) is actually in unhealthy status. Users may not notice it unless there is an automatic monitoring system, such as prometheus.

Without exclude=NOSPACE

  • Good side: The etcd member will be restarted by the liveness Probe, and will never get started due to startup Probe. Accordingly it can get immediate attention from administrator;
  • Bad side: It can't serve even range/read requests. The system which depends on the etcd cluster will be totally down.

Proposal

How about expose the configuration to users? And the default value is with exclude=NOSPACE ?

Copy link
Member Author

@neolit123 neolit123 Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about expose the configuration to users? And the default value is with exclude=NOSPACE ?

users are already able to customize the probes with static pod manifests patches:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#patches
so i guess the discussion here is leading towards having the suggestion as here:
https://github.com/kubernetes/kubernetes/pull/110744/files#r905816216
by default.

i will update the PR.

As per the etcd maintainers' recommendation - startup probes
shouldn't be serialized, while the liveness probes should be.
@neolit123 neolit123 force-pushed the 1.25-update-etcd-startup-probe branch from e71cc7c to 2829fc0 Compare Jun 30, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 30, 2022
@neolit123
Copy link
Member Author

neolit123 commented Jun 30, 2022

/hold cancel
updated.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 30, 2022
@pacoxu
Copy link
Member

pacoxu commented Jul 1, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 8b02217 into kubernetes:master Jul 1, 2022
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 1, 2022
@anguslees
Copy link
Member

anguslees commented Nov 2, 2022

Q: Can someone help me understand why startupProbe (serializable=false) != livenessProbe (serializable=true) ?

I've read the background etcd comment, which says:

... so that the liveness probe is disabled until the etcd cluster is healthy and working. This is for the startup phase of the etcd cluster.

I think the usual k8s pattern suggests we use readinessProbe to prevent an etcd member from being used until it is ready. When readinessProbe uses serializable=false, I think that verifies pod has exited "startup phase" before clients use it. Why is serializable=false startupProbe also useful/necessary?

To be clear, I think this startupProbe is mostly harmless too, and I don't have an actual objection to it. I think startupProbe serializable=false means an etcd pod in an unhealthy cluster will restart a few extra times when startupProbe eventually times out (compared to startupProbe serializable=true, which would transition to live=true ready=false). I think the extra restarts might extend the recovery time of the cluster, but won't actually prevent it recovering. Iiuc.

@neolit123
Copy link
Member Author

neolit123 commented Nov 2, 2022

on this PR and on the issue kubernetes/kubeadm#2567 we had all the discussion on this topic.

@ahrtr might be able to provide additional comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvements for etcd liveness probes
5 participants