Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: Detect kubelet readiness and error out if the kubelet is unhealthy #51369

Merged
merged 1 commit into from
Sep 3, 2017

Conversation

luxas
Copy link
Member

@luxas luxas commented Aug 25, 2017

What this PR does / why we need it:

In order to improve the UX when the kubelet is unhealthy or stopped, or whatever, kubeadm now polls the kubelet's API after 40 and 60 seconds, and then performs an exponential backoff for a total of 155 seconds.

If the kubelet endpoint is not returning ok by then, kubeadm gives up and exits.

This will miligate at least 60% of our "[apiclient] Created API client, waiting for control plane to come up" issues in the kubeadm issue tracker 🎉, as kubeadm now informs the user what's wrong and also doesn't deadlock like before.

Demo:

lucas@THEGOPHER:~/luxas/kubernetes$ sudo ./kubeadm init --skip-preflight-checks
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.7.4
[init] Using Authorization modes: [Node RBAC]
[preflight] Skipping pre-flight checks
[kubeadm] WARNING: starting in 1.8, tokens expire after 24 hours by default (if you require a non-expiring token use --token-ttl 0)
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [thegopher kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.1.115]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
[init] This often takes around a minute; or longer if the control plane images have to be pulled.
[apiclient] All control plane components are healthy after 40.502199 seconds
[markmaster] Will mark node thegopher as master by adding a label and a taint
[markmaster] Master thegopher tainted and labelled with key/value: node-role.kubernetes.io/master=""
[bootstraptoken] Using token: 5776d5.91e7ed14f9e274df
[bootstraptoken] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[addons] Applied essential addon: kube-dns
[addons] Applied essential addon: kube-proxy

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run (as a regular user):

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token 5776d5.91e7ed14f9e274df 192.168.1.115:6443 --discovery-token-ca-cert-hash sha256:6f301ce8c3f5f6558090b2c3599d26d6fc94ffa3c3565ffac952f4f0c7a9b2a9

lucas@THEGOPHER:~/luxas/kubernetes$ sudo ./kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
lucas@THEGOPHER:~/luxas/kubernetes$ sudo systemctl stop kubelet
lucas@THEGOPHER:~/luxas/kubernetes$ sudo ./kubeadm init --skip-preflight-checks
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.7.4
[init] Using Authorization modes: [Node RBAC]
[preflight] Skipping pre-flight checks
[kubeadm] WARNING: starting in 1.8, tokens expire after 24 hours by default (if you require a non-expiring token use --token-ttl 0)
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [thegopher kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.1.115]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
[init] This often takes around a minute; or longer if the control plane images have to be pulled.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz/syncloop' failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz/syncloop' failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz/syncloop' failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz/syncloop' failed with error: Get http://localhost:10255/healthz/syncloop: dial tcp 127.0.0.1:10255: getsockopt: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10255/healthz' failed with error: Get http://localhost:10255/healthz: dial tcp 127.0.0.1:10255: getsockopt: connection refused.

Unfortunately, an error has occurred:
	timed out waiting for the condition

This error is likely caused by that:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
	- There is no internet connection; so the kubelet can't pull the following control plane images:
		- gcr.io/google_containers/kube-apiserver-amd64:v1.7.4
		- gcr.io/google_containers/kube-controller-manager-amd64:v1.7.4
		- gcr.io/google_containers/kube-scheduler-amd64:v1.7.4

You can troubleshoot this for example with the following commands if you're on a systemd-powered system:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'
couldn't initialize a Kubernetes cluster

In this demo, I'm first starting kubeadm normally and everything works as usual.
In the second case, I'm explicitely stopping the kubelet so it doesn't run, and skipping preflight checks, so that kubeadm doesn't even try to exec systemctl start kubelet like it does usually.
That obviously results in a non-working system, but now kubeadm tells the user what's the problem instead of waiting forever.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Fixes: kubernetes/kubeadm#377

Special notes for your reviewer:

Release note:

kubeadm: Detect kubelet readiness and error out if the kubelet is unhealthy

@kubernetes/sig-cluster-lifecycle-pr-reviews @pipejakob

cc @justinsb @kris-nova @lukemarsden as well as you wanted this feature :)

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 25, 2017
@luxas luxas added this to the v1.8 milestone Aug 25, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 25, 2017
@luxas
Copy link
Member Author

luxas commented Aug 27, 2017

/retest

@luxas
Copy link
Member Author

luxas commented Aug 27, 2017

/retest

Copy link
Contributor

@mattmoyer mattmoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really happy to have better diagnostic output here. It should be a big help to admins when things go wrong.

I'm slightly concerned that 155 seconds will not be enough. It seems hard to pick a safe timeout for this because it needs to cover a relatively large download over the user's internet connection which might be almost arbitrarily slow. Maybe we could keep trying forever, but start printing helpful warning/error messages after a certain amount of time? We might also just print out a message saying that kubeadm init has given up, but the kubelet might still get things up and running if you are able to restart or otherwise fix it.

My other style concern is that the number of timeout/retry constants here is kind of complex because of the way TryRunCommand and waitForAPIAndKubelet are nested. Perhaps there's a way we could refactor so there's a single select that blocks until the first failure condition is met? This is not a blocker.

The failing test looks like #51429.

Unfortunately, an error has occurred:
{{ .Error }}

This error is likely caused by that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe This error is likely caused by one of these problems:?


go func(errC chan error, waiter apiclient.Waiter) {
// This goroutine can only make kubeadm init fail. If this check succeeds, it won't do anything special
if err := waiter.WaitForHealthyKubelet(40*time.Second, "http://localhost:10255/healthz"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to assume the port number here? Isn't it configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to assume that; it's the standard for the kubelet readonly port. If you change that manually you will way larger problems :)

@luxas
Copy link
Member Author

luxas commented Sep 1, 2017

I'm slightly concerned that 155 seconds will not be enough. It seems hard to pick a safe timeout for this because it needs to cover a relatively large download over the user's internet connection which might be almost arbitrarily slow

That timeout is currently 30 minutes; can increase if you think that'd be useful.

My other style concern is that the number of timeout/retry constants here is kind of complex because of the way TryRunCommand and waitForAPIAndKubelet are nested. Perhaps there's a way we could refactor so there's a single select that blocks until the first failure condition is met? This is not a blocker.

They aren't nested. TryRunCommand is used for checking the kubelet's health and the polling is used in WaitForAPI waits for the API server to become available in a second goroutine.

Maybe we could keep trying forever

That is what we explicitely want to avoid in prod scenarios.

Note that there are three different goroutines:

  • WaitForAPI waits for the API server to become running.
    • This requires the kubelet to come up healthy, the images to be pullable and the API server to start correctly
    • Timeouts after 30 mins
  • TryRunCommand that polls /healthz -- which is kubelet liveness
    • Starts 40 secs after WaitForAPI (as the kubelet should have come up successfully in >90% of the cases there, we don't want to start too early)
    • Timeouts after 155 secs
  • TryRunCommand that polls /healthz/syncloop -- which is kubelet readiness AFAIU
    • Starts 60 secs after WaitForAPI (as the kubelet should have come up successfully in >90% of the cases there, we don't want to start too early)
    • Timeouts after 155 secs

And there is a single wait block; the first goroutine to return something will be the return value used by the func. This means that if the /healthz check still isn't working after 195 secs, an error will be returned and kubeadm init will fail.

@luxas
Copy link
Member Author

luxas commented Sep 1, 2017

/retest

1 similar comment
@luxas
Copy link
Member Author

luxas commented Sep 1, 2017

/retest

@mattmoyer
Copy link
Contributor

/lgtm

We can always make the timeout configurable later if it is an issue in real environments. This change should be a strict improvement over the current behavior, so I don't want to block merging over that.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: luxas, mattmoyer

Associated issue: 377

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@luxas
Copy link
Member Author

luxas commented Sep 1, 2017

We can always make the timeout configurable later if it is an issue in real environments

Note that this 30min timeout already exists at HEAD; it isn't added or changed in this PR

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@luxas
Copy link
Member Author

luxas commented Sep 2, 2017

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2017
@luxas
Copy link
Member Author

luxas commented Sep 3, 2017

Rebased, re-applying LGTM

@luxas luxas added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2017
@luxas
Copy link
Member Author

luxas commented Sep 3, 2017

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@luxas: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 92c5997 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 51682, 51546, 51369, 50924, 51827)

fmt.Printf("[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory %q\n", kubeadmconstants.GetStaticPodDirectory())
fmt.Println("[init] This often takes around a minute; or longer if the control plane images have to be pulled.")

go func(errC chan error, waiter apiclient.Waiter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be based on --skip-preflight-checks, if this flag is set, we shouldn't check the status of kubelet, if not it makes sense to check the status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hard line to balance.. I'd prefer to have this as-is although there is a small change it will timeout before all the images have pulled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to discuss this, please open a new issue in kubernetes/kubeadm though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better detection and auto-debugging of "waiting for control plane" issues
8 participants