Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes lose Linode metadata after a reboot #36

Closed
asauber opened this issue Aug 2, 2019 · 4 comments
Closed

Nodes lose Linode metadata after a reboot #36

asauber opened this issue Aug 2, 2019 · 4 comments

Comments

@asauber
Copy link
Contributor

asauber commented Aug 2, 2019

Moving this Issue from the CSI repo.

linode/linode-blockstorage-csi-driver#34

Report from @wideareashb

After an OOM problem on one of our Linodes, CSI has stopped working and the Linode is not properly part of the cluster anymore.

The CoreOS logs for the node with the problem has a few of these:

systemd-networkd[604]: eth0: Could not set NDisc route or address: Connection timed out

then there is an OOM

and then more messages like the above. There are also a number of problems which seem to be caused by the OOM event

Since then we have seen a number of problems:

  • the node did not appear in 'kubectl get nodes'
  • after a reboot the node is no longer properly recognised by Kubernetes:

e.g.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node-2 Ready 157d v1.13.0 192.168.145.28 213.xxx.xxx.xxx Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3
node-3 Ready 5h41m v1.13.0 Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3

Note:

  • the age should be 157d

  • it has not internal or external IP address

  • If you describe the node:

  • the nodes annotations like 'providerID' were missing (we have tried adding them back in);

  • the node was deScheduled

  • the node had not internal or external IP address

  • after fiddling with annotations, the node did get pods scheduled but CSI Linode is upset

Aug 01 09:12:03 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi
Aug 01 09:12:03 node-3 kubelet[687]: ]
Aug 01 09:12:18 node-3 kubelet[687]: E0801 09:12:18.716293 687 pod_workers.go:190] Error syncing pod c55016b7-b439-11e9-a66e-f23c914badbb ("csi-linode-node-4rqz6_kube-system(c55>
Aug 01 09:12:18 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi>
Aug 01 09:12:18 node-3 kubelet[687]: ]

  • in the main CSI csi-linode-controller-0 container linode-csi-plugin logs we are seeing:

BODY :
{
"errors": [
{
"reason": "Invalid OAuth Token"
}
]
}

I think this has little to do with CSI -- after I restarted the Linode it has lost the traits it needs to work. For example if I edit this Linode I get something like the below:

metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 192.168.xxx.yyy/17
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-08-01T15:28:39Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/hostname: node-3
name: widearea-live-node-3
resourceVersion: "30261967"
selfLink: /api/v1/nodes/node-3
uid: 097dd8b3-xxxxxx
spec:
podCIDR: 10.244.5.0/24
taints:

  • effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"
    but for a working one I get:

metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"linodebs.csi.linode.com":"1234567"}'
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 192.168.xxx.yyyy/17
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-02-24T19:26:35Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: g6-standard-4
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: eu-west
kubernetes.io/hostname: node-2
topology.linode.com/region: eu-west
name: node-2
resourceVersion: "30262732"
selfLink: /api/v1/nodes/node-2
uid: 19841cc5-xxxxxxxx
spec:
podCIDR: 10.244.1.0/24
providerID: linode://1234567
status:
addresses:

  • address: live-node-2
    type: Hostname
  • address: 213.x.y.z
    type: ExternalIP
  • address: 192.168.aaa.bbbb
    type: InternalIP
    allocatable:
    attachable-volumes-csi-linodebs.csi.linode.com: "8"
    Our problem has lost all its Linode traits.

I am looking into this and attempting to reproduce.

@asauber asauber changed the title Nodes list Linode metadata after a reboot Nodes lose Linode metadata after a reboot Aug 2, 2019
@asauber
Copy link
Contributor Author

asauber commented Aug 2, 2019

At this point it looks like this issue was mostly caused by an expired Linode API token.

For reference, you can refresh this token by visiting https://cloud.linode.com/profile/tokens, generating a new token, and editing the secret in your cluster.

# make sure to echo without a newline, -n
echo -n <secret> | base64

# copy output from above
k edit secret -n kube-system linode

# edit and save the token in this secret

@asauber
Copy link
Contributor Author

asauber commented Aug 8, 2019

The reporter of this issue also experienced linode/linode-blockstorage-csi-driver#32

This issue was resolved by issuing new Linode API tokens to this cluster.

@asauber asauber closed this as completed Aug 8, 2019
@displague
Copy link
Contributor

Was a 403 or similar access warning emitted to the logs that could have helped identify the problem?

@asauber
Copy link
Contributor Author

asauber commented Aug 15, 2019

Yes, the CCM logs showed a 403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants