Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to add node to cluster after cluster-reset #6186

Closed
zbup opened this issue Sep 28, 2022 · 14 comments
Closed

Unable to add node to cluster after cluster-reset #6186

zbup opened this issue Sep 28, 2022 · 14 comments

Comments

@zbup
Copy link

zbup commented Sep 28, 2022

Environmental Info:
K3s Version:
k3s version v1.24.5-rc1+k3s1 (fb823c8)
go version go1.18.6

Node(s) CPU architecture, OS, and Version:
Ubuntu 18.04
Linux firstnode 4.15.0-193-generic #204-Ubuntu SMP Fri Aug 26 19:20:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
For now, just 2 servers (which I realize isn't ideal with etcd but I think it might be unrelated)

Describe the bug:
Steps To Reproduce:

  1. Bring up a new cluster
  2. Add an additional node (everything seems fine)
  3. Shut down nodes
  4. Run --cluster-reset on second node
  5. Bring cluster back up on second node (everything seems fine)
  6. Use kubectl to delete the failed node (which sits in NotReady state)
  7. Run k3s-uninstall.sh on first node to start clean
  8. Run k3s server --server <second node>
    It never seems to be able to bring the first node back into the cluster. I am trialing k3s and simulating a server failure and trying to rebuild the cluster. The node just sits in NotReady

The logs just keep saying over and over:

Sep 28 19:06:09 cert4 k3s[19809]: I0928 19:06:09.946192   19809 node_controller.go:406] Initializing node firstnode with cloud provider
Sep 28 19:06:09 cert4 k3s[19809]: E0928 19:06:09.946299   19809 node_controller.go:220] error syncing 'firstnode': failed to get provider ID for node firstnode at cloudprovider: failed to get instance ID from cloud provider: address annotations not yet set, requeuing
Sep 28 19:06:13 cert4 k3s[19809]: E0928 19:06:13.854772   19809 node_lifecycle_controller.go:149] error checking if node firstnode exists: address annotations not yet set

And eventually this error also starts to appear with the previous ones:

Sep 28 19:06:11 cert4 k3s[19809]: E0928 19:06:11.508514   19809 server.go:274] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"

A couple more that showed up:

Sep 28 19:20:24 cert4 k3s[19809]: time="2022-09-28T19:20:24Z" level=info msg="Couldn't find node internal ip annotation or label on node cert2.ash.lxdx"
Sep 28 19:20:24 cert4 k3s[19809]: time="2022-09-28T19:20:24Z" level=info msg="Couldn't find node hostname annotation or label on node cert2.ash.lxdx"

Expected behavior:
I should be able to add a node

Actual behavior:
Errors printed above

@zbup
Copy link
Author

zbup commented Sep 28, 2022

For the sake of completeness.... I create a config file on all the nodes: /etc/rancher/k3s/config.yaml

token: blahblahblah
cluster-cidr: 10.250.192.0/19
service-cidr: 10.250.224.0/19
cluster-dns: 10.250.224.10

To install first node:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.24.5-rc1+k3s1 sh -s - server --cluster-init --node-name firstnode
To install second node:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.24.5-rc1+k3s1 sh -s - server --server https://firstnode:6443 --node-name secondnode

All good at this point.. Shutdown both nodes with k3s-kill-all.sh

On second node I do a cluster reset
sudo k3s server --cluster-reset --node-name secondnode
Bring the cluster back up on the second node with systemctl start k3s

On first server I do a k3s-uninstall.sh

I wait for a minute until the cluster looks happy on the second node... Pods are restarted. Everything looks happy.

I remove the second node with kubectl delete node firstnode

Then I try to bootstrap the second node back into the cluster (making sure the config file is recreated) and then:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.24.5-rc1+k3s1 sh -s - server --server https://secondnode:6443 --node-name firstnode

Host names are redacted, DNS works for host names.

@zbup
Copy link
Author

zbup commented Sep 28, 2022

Output of kubectl get node firstnode

apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2022-09-28T19:05:35Z"
  finalizers:
  - wrangler.cattle.io/node
  - wrangler.cattle.io/managed-etcd-controller
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: firstnode
    kubernetes.io/os: linux
  name: firstnode
  resourceVersion: "3791"
  uid: c3da2936-3547-4e79-8815-42e0319fc66c
spec:
  podCIDR: 10.250.192.0/24
  podCIDRs:
  - 10.250.192.0/24
  taints:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
    timeAdded: "2022-09-28T19:06:18Z"
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    timeAdded: "2022-09-28T19:06:23Z"
status:                                                                                                                                                                                                                                 [33/466]
  addresses:
  - address: 10.254.9.15
    type: InternalIP
  - address: cert2.ash.lxdx
    type: Hostname
  allocatable:
    cpu: "48"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 65845228Ki
    pods: "110"
  capacity:
    cpu: "48"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 65845228Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2022-09-28T19:05:35Z"
    lastTransitionTime: "2022-09-28T19:06:18Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: MemoryPressure
  - lastHeartbeatTime: "2022-09-28T19:05:35Z"
    lastTransitionTime: "2022-09-28T19:06:18Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: DiskPressure
  - lastHeartbeatTime: "2022-09-28T19:05:35Z"
    lastTransitionTime: "2022-09-28T19:06:18Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: PIDPressure
  - lastHeartbeatTime: "2022-09-28T19:05:35Z"
    lastTransitionTime: "2022-09-28T19:06:18Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  nodeInfo:
    architecture: amd64
    bootID: ba18a5a5-4e08-4ac9-958b-d8899f77b8be
    containerRuntimeVersion: containerd://1.6.8-k3s1
    kernelVersion: 4.15.0-193-generic
    kubeProxyVersion: v1.24.5-rc1+k3s1
    kubeletVersion: v1.24.5-rc1+k3s1
    machineID: b365db165d3f4faeb31a7a1d8c798d0c
    operatingSystem: linux
    osImage: Ubuntu 18.04.6 LTS
    systemUUID: 00000000-0000-0000-0000-0CC47A394574

@zbup
Copy link
Author

zbup commented Sep 28, 2022

Okay, I worked around the issue.... But wondering if it's still something that should be looked at...

Even though I deleted the node using kubectl, clearly there is still some history of the node in the cluster somewhere...

I was able to add the first node back into the cluster by manually setting a different --node-name.

@brandond
Copy link
Contributor

brandond commented Sep 28, 2022

When you delete the first node from the cluster, does it actually finish deleting - are both the node and the node password secret gone from the cluster?

@zbup
Copy link
Author

zbup commented Sep 28, 2022

I am not totally sure how to tell if it fully finished deleting... But the node and the secret are indeed no longer there.

The only reference I see is:

Sep 28 20:06:58 cert4 k3s[29675]: time="2022-09-28T20:06:58Z" level=info msg="Removed coredns node hosts entry [10.254.9.15 firstnode]"

@zbup
Copy link
Author

zbup commented Sep 28, 2022

There must be some sort of delay for clearing things out when you delete the node but only sometimes. I've waited 5-10 minutes and it's still having issues with the same name.
I just tried again and it worked pretty quickly and I was able to re-add the node with the same name. I guess I am waiting for this log message before attempting to add again:

Sep 28 20:16:58 cert2 k3s[32647]: I0928 20:16:58.677168   32647 event.go:294] "Event occurred" object="firstnode" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="RemovingNode" message="Node firstnode event: Removing Node firstnode from Controller"

@zbup
Copy link
Author

zbup commented Sep 28, 2022

I think I'm good here, I will close for now unless you want me to do any more digging on it.

@zbup zbup closed this as completed Sep 28, 2022
@zbup zbup reopened this Sep 29, 2022
@zbup
Copy link
Author

zbup commented Sep 29, 2022

I tried from scratch again and I'm still having the same issue again. I don't know if it's timing thing.
Brought up a 2 node cluster, shut it down, did a --cluster-reset brought it up, removed the failed node. Uninstalled and re-installed k3s (also removed /etc/rancher/node/password which isn't deleted by k3s-uninstall.sh) and then tried to add the node back in again and it's stuck in NotReady.

@zbup
Copy link
Author

zbup commented Sep 29, 2022

Okay, this is interesting.. After doing the cluster reset on one of the nodes bringing it back up and then trying to add a node back (that had been completely uninstalled and then re-installed) the first node that had the reset done shows this in get nodes

NAME              STATUS     ROLES                       AGE   VERSION
primary-node-that-had-reset    Ready      control-plane,etcd,master   17m   v1.24.5-rc1+k3s1
node-I-tried-to-re-add with new node-name   NotReady   <none>                      7s    v1.24.5-rc1+k3s1

And then on the node I added back (totally uninstalled and re-installed) when I do get nodes:

root@cert4:/var/lib/rancher# kubectl get nodes
NAME              STATUS     ROLES                       AGE     VERSION
primary-node-that-had-reset    NotReady   control-plane,etcd,master   21m     v1.24.5-rc1+k3s1
old-node-name   NotReady   control-plane,etcd,master   19m     v1.24.5-rc1+k3s1
node-I-tried-to-re-add with new node-name   Ready      control-plane,etcd,master   3m28s   v1.24.5-rc1+k3s1

Is etcd getting confused? Shouldn't etcd have a single master after --cluster-reset It's like the new node I brought in gets started with an old copy of the etcd database and then just adds to it.. and they are out of sync somehow. The first node had the node removed even before I started adding it back again. It shouldn't think it should know about old-node-name anymore.

@zbup
Copy link
Author

zbup commented Sep 30, 2022

Tried with v1.22.15+k3s1 and v1.24.6+k3s1 and they both seem to do the same thing. It's like it performs the initial sync of the new node with an old copy of the etcd database and they never come back into sync.

@brandond
Copy link
Contributor

This sounds very similar to etcd-io/etcd#14009 - unfortunately I haven't been able to reproduce it without involving Kubernetes, so upstream has had a hard time addressing it.

@zbup
Copy link
Author

zbup commented Oct 3, 2022

Man, I can reproduce it nearly every time. I've done it no less than 30 times and it happens in 28 of them. I will try restoring a snapshot. I guess it doesn't matter if it's not etcd by itself.

@zbup zbup closed this as completed Oct 3, 2022
@zbup zbup reopened this Oct 3, 2022
@zbup
Copy link
Author

zbup commented Oct 4, 2022

Okay, I verified I can cluster-reset, start, snapshot, cluster-reset with snapshot and get my cluster going again.

@stale
Copy link

stale bot commented Apr 2, 2023

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Apr 2, 2023
@stale stale bot closed this as completed Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants