CSI stopped working #34

wideareashb · 2019-08-01T15:06:50Z

After an OOM problem on one of our Linodes, CSI has stopped working and the Linode is not properly part of the cluster anymore.

The CoreOS logs for the node with the problem has a few of these:

systemd-networkd[604]: eth0: Could not set NDisc route or address: Connection timed out

then there is an OOM

and then more messages like the above. There are also a number of problems which seem to be caused by the OOM event

Since then we have seen a number of problems:

the node did not appear in 'kubectl get nodes'
after a reboot the node is no longer properly recognised by Kubernetes:

e.g.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node-2 Ready 157d v1.13.0 192.168.145.28 213.xxx.xxx.xxx Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3
node-3 Ready 5h41m v1.13.0 Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3

Note:

the age should be 157d
it has not internal or external IP address

If you describe the node:

the nodes annotations like 'providerID' were missing (we have tried adding them back in);
the node was deScheduled
the node had not internal or external IP address

after fiddling with annotations, the node did get pods scheduled but CSI Linode is upset

Aug 01 09:12:03 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi
Aug 01 09:12:03 node-3 kubelet[687]: ]
Aug 01 09:12:18 node-3 kubelet[687]: E0801 09:12:18.716293 687 pod_workers.go:190] Error syncing pod c55016b7-b439-11e9-a66e-f23c914badbb ("csi-linode-node-4rqz6_kube-system(c55>
Aug 01 09:12:18 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi>
Aug 01 09:12:18 node-3 kubelet[687]: ]

in the main CSI csi-linode-controller-0 container linode-csi-plugin logs we are seeing:

BODY :
{
"errors": [
{
"reason": "Invalid OAuth Token"
}
]
}

wideareashb · 2019-08-01T15:43:20Z

I think this has little to do with CSI -- after I restarted the Linode it has lost the traits it needs to work. For example if I edit this Linode I get something like the below:

metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 192.168.xxx.yyy/17
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-08-01T15:28:39Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/hostname: node-3
  name: widearea-live-node-3
  resourceVersion: "30261967"
  selfLink: /api/v1/nodes/node-3
  uid: 097dd8b3-xxxxxx
spec:
  podCIDR: 10.244.5.0/24
  taints:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"

but for a working one I get:

metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"linodebs.csi.linode.com":"1234567"}'
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 192.168.xxx.yyyy/17
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-02-24T19:26:35Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g6-standard-4
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-west
    kubernetes.io/hostname: node-2
    topology.linode.com/region: eu-west
  name: node-2
  resourceVersion: "30262732"
  selfLink: /api/v1/nodes/node-2
  uid: 19841cc5-xxxxxxxx
spec:
  podCIDR: 10.244.1.0/24
  providerID: linode://1234567
status:
  addresses:
  - address: live-node-2
    type: Hostname
  - address: 213.x.y.z
    type: ExternalIP
  - address: 192.168.aaa.bbbb
    type: InternalIP
  allocatable:
    attachable-volumes-csi-linodebs.csi.linode.com: "8"

Our problem has lost all its Linode traits.

asauber · 2019-08-02T12:03:08Z

Closed in favor of linode/linode-cloud-controller-manager#36

asauber mentioned this issue Aug 2, 2019

Nodes lose Linode metadata after a reboot linode/linode-cloud-controller-manager#36

Closed

asauber closed this as completed Aug 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI stopped working #34

CSI stopped working #34

wideareashb commented Aug 1, 2019

wideareashb commented Aug 1, 2019

asauber commented Aug 2, 2019

CSI stopped working #34

CSI stopped working #34

Comments

wideareashb commented Aug 1, 2019

wideareashb commented Aug 1, 2019

asauber commented Aug 2, 2019