Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI stopped working #34

Closed
wideareashb opened this issue Aug 1, 2019 · 2 comments
Closed

CSI stopped working #34

wideareashb opened this issue Aug 1, 2019 · 2 comments

Comments

@wideareashb
Copy link

After an OOM problem on one of our Linodes, CSI has stopped working and the Linode is not properly part of the cluster anymore.

The CoreOS logs for the node with the problem has a few of these:

systemd-networkd[604]: eth0: Could not set NDisc route or address: Connection timed out

then there is an OOM

and then more messages like the above. There are also a number of problems which seem to be caused by the OOM event

Since then we have seen a number of problems:

  • the node did not appear in 'kubectl get nodes'

  • after a reboot the node is no longer properly recognised by Kubernetes:

e.g.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node-2 Ready 157d v1.13.0 192.168.145.28 213.xxx.xxx.xxx Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3
node-3 Ready 5h41m v1.13.0 Container Linux by CoreOS 2135.5.0 (Rhyolite) 4.19.50-coreos-r1 docker://18.6.3

Note:

  • the age should be 157d
  • it has not internal or external IP address
  • If you describe the node:
  • the nodes annotations like 'providerID' were missing (we have tried adding them back in);
  • the node was deScheduled
  • the node had not internal or external IP address
  • after fiddling with annotations, the node did get pods scheduled but CSI Linode is upset

Aug 01 09:12:03 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi
Aug 01 09:12:03 node-3 kubelet[687]: ]
Aug 01 09:12:18 node-3 kubelet[687]: E0801 09:12:18.716293 687 pod_workers.go:190] Error syncing pod c55016b7-b439-11e9-a66e-f23c914badbb ("csi-linode-node-4rqz6_kube-system(c55>
Aug 01 09:12:18 node-3 kubelet[687]: , failed to "StartContainer" for "csi-linode-plugin" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=csi-linode-plugin pod=csi>
Aug 01 09:12:18 node-3 kubelet[687]: ]

  • in the main CSI csi-linode-controller-0 container linode-csi-plugin logs we are seeing:

BODY :
{
"errors": [
{
"reason": "Invalid OAuth Token"
}
]
}

@wideareashb
Copy link
Author

I think this has little to do with CSI -- after I restarted the Linode it has lost the traits it needs to work. For example if I edit this Linode I get something like the below:

metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 192.168.xxx.yyy/17
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-08-01T15:28:39Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/hostname: node-3
  name: widearea-live-node-3
  resourceVersion: "30261967"
  selfLink: /api/v1/nodes/node-3
  uid: 097dd8b3-xxxxxx
spec:
  podCIDR: 10.244.5.0/24
  taints:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"

but for a working one I get:

metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"linodebs.csi.linode.com":"1234567"}'
    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 192.168.xxx.yyyy/17
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2019-02-24T19:26:35Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g6-standard-4
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-west
    kubernetes.io/hostname: node-2
    topology.linode.com/region: eu-west
  name: node-2
  resourceVersion: "30262732"
  selfLink: /api/v1/nodes/node-2
  uid: 19841cc5-xxxxxxxx
spec:
  podCIDR: 10.244.1.0/24
  providerID: linode://1234567
status:
  addresses:
  - address: live-node-2
    type: Hostname
  - address: 213.x.y.z
    type: ExternalIP
  - address: 192.168.aaa.bbbb
    type: InternalIP
  allocatable:
    attachable-volumes-csi-linodebs.csi.linode.com: "8"

Our problem has lost all its Linode traits.

@asauber
Copy link
Contributor

asauber commented Aug 2, 2019

Closed in favor of linode/linode-cloud-controller-manager#36

@asauber asauber closed this as completed Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants