Nodes lose Linode metadata after a reboot #36

asauber · 2019-08-02T11:56:55Z

Moving this Issue from the CSI repo.

linode/linode-blockstorage-csi-driver#34

After an OOM problem on one of our Linodes, CSI has stopped working and the Linode is not properly part of the cluster anymore.

The CoreOS logs for the node with the problem has a few of these:

systemd-networkd[604]: eth0: Could not set NDisc route or address: Connection timed out

then there is an OOM

and then more messages like the above. There are also a number of problems which seem to be caused by the OOM event

Since then we have seen a number of problems:

the node did not appear in 'kubectl get nodes'

after a reboot the node is no longer properly recognised by Kubernetes:

e.g.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node-2 Ready 157d v1.13.0 192.168.145.28 213.xxx.xxx.xxx Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3
node-3 Ready 5h41m v1.13.0 Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3

Note:

the age should be 157d

it has not internal or external IP address

If you describe the node:

the nodes annotations like 'providerID' were missing (we have tried adding them back in);

the node was deScheduled

the node had not internal or external IP address

after fiddling with annotations, the node did get pods scheduled but CSI Linode is upset

Aug 01 09:12:03 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi
Aug 01 09:12:03 node-3 kubelet[687]: ]
Aug 01 09:12:18 node-3 kubelet[687]: E0801 09:12:18.716293 687 pod_workers.go:190] Error syncing pod c55016b7-b439-11e9-a66e-f23c914badbb ("csi-linode-node-4rqz6_kube-system(c55>
Aug 01 09:12:18 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi>
Aug 01 09:12:18 node-3 kubelet[687]: ]

in the main CSI csi-linode-controller-0 container linode-csi-plugin logs we are seeing:

BODY :
{
"errors": [
{
"reason": "Invalid OAuth Token"
}
]
}

I think this has little to do with CSI -- after I restarted the Linode it has lost the traits it needs to work. For example if I edit this Linode I get something like the below:

metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 192.168.xxx.yyy/17
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-08-01T15:28:39Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/hostname: node-3
name: widearea-live-node-3
resourceVersion: "30261967"
selfLink: /api/v1/nodes/node-3
uid: 097dd8b3-xxxxxx
spec:
podCIDR: 10.244.5.0/24
taints:

effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
but for a working one I get:

metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"linodebs.csi.linode.com":"1234567"}'
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 192.168.xxx.yyyy/17
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-02-24T19:26:35Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: g6-standard-4
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: eu-west
kubernetes.io/hostname: node-2
topology.linode.com/region: eu-west
name: node-2
resourceVersion: "30262732"
selfLink: /api/v1/nodes/node-2
uid: 19841cc5-xxxxxxxx
spec:
podCIDR: 10.244.1.0/24
providerID: linode://1234567
status:
addresses:

address: live-node-2
type: Hostname

address: 213.x.y.z
type: ExternalIP

address: 192.168.aaa.bbbb
type: InternalIP
allocatable:
attachable-volumes-csi-linodebs.csi.linode.com: "8"
Our problem has lost all its Linode traits.

I am looking into this and attempting to reproduce.

asauber · 2019-08-02T13:22:18Z

At this point it looks like this issue was mostly caused by an expired Linode API token.

For reference, you can refresh this token by visiting https://cloud.linode.com/profile/tokens, generating a new token, and editing the secret in your cluster.

# make sure to echo without a newline, -n
echo -n <secret> | base64

# copy output from above
k edit secret -n kube-system linode

# edit and save the token in this secret

asauber · 2019-08-08T18:57:08Z

The reporter of this issue also experienced linode/linode-blockstorage-csi-driver#32

This issue was resolved by issuing new Linode API tokens to this cluster.

displague · 2019-08-12T13:39:22Z

Was a 403 or similar access warning emitted to the logs that could have helped identify the problem?

asauber · 2019-08-15T12:43:44Z

Yes, the CCM logs showed a 403

asauber changed the title ~~Nodes list Linode metadata after a reboot~~ Nodes lose Linode metadata after a reboot Aug 2, 2019

asauber mentioned this issue Aug 2, 2019

CSI stopped working linode/linode-blockstorage-csi-driver#34

Closed

asauber closed this as completed Aug 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes lose Linode metadata after a reboot #36

Nodes lose Linode metadata after a reboot #36

asauber commented Aug 2, 2019

asauber commented Aug 2, 2019 •

edited

Loading

asauber commented Aug 8, 2019 •

edited

Loading

displague commented Aug 12, 2019

asauber commented Aug 15, 2019

Nodes lose Linode metadata after a reboot #36

Nodes lose Linode metadata after a reboot #36

Comments

asauber commented Aug 2, 2019

asauber commented Aug 2, 2019 • edited Loading

asauber commented Aug 8, 2019 • edited Loading

displague commented Aug 12, 2019

asauber commented Aug 15, 2019

asauber commented Aug 2, 2019 •

edited

Loading

asauber commented Aug 8, 2019 •

edited

Loading