New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node that is restarted never reconnects to cluster #45753

Closed
sjezewski opened this Issue May 12, 2017 · 6 comments

Comments

Projects
None yet
6 participants
@sjezewski

sjezewski commented May 12, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): restart node


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release):
$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a): Linux ip-172-20-34-246 4.4.41-k8s #1 SMP Mon Jan 9 15:34:39 UTC 2017 x86_64 GNU/Linux
  • Install tools: kops Version 1.6.0-beta.1 (git-77f222d)
  • Others:

What happened:

As mentioned here I need to restart a node as part of the GPU nvidia driver installation process.

However, when doing a restart (either via /sbin/shutdown -r or via the AWS UI), the node never seems to come back into the k8s cluster (it never shows up in the output of kubectl get nodes) ... UNLESS ... I kill the api server pod, e.g:

$kubectl --namespace=kube-system delete po/kube-apiserver-ip-1-2-3-4.us-west-2.compute.internal

It takes ~2-3 min for the node to show up again ... but it does show up under the output of kubectl get nodes

I don't think its just a matter of waiting. I've waited an hour after a restart and the node never re-appeared. It seems I must kill the api-server pod for the node to get detected again.

What you expected to happen:

After a node restart, the node would appear ready and part of the k8s cluster according to kubectl get nodes

How to reproduce it (as minimally and precisely as possible):

I believe its a matter of just restarting any VM. I've only tested on AWS though.

Anything else we need to know:

@huangjiasingle

This comment has been minimized.

huangjiasingle commented May 16, 2017

@sjezewski please make the same veriosn between client and version

@kargakis

This comment has been minimized.

Member

kargakis commented May 21, 2017

@resouer

This comment has been minimized.

Member

resouer commented May 24, 2017

Can you make sure kubelet is auto started after your VM is rebooted?

@fejta-bot

This comment has been minimized.

fejta-bot commented Dec 25, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

fejta-bot commented Jan 24, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

fejta-bot commented Feb 23, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment