Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods mounting EFS-CSI-driver-based volumes stuck in ContainerCreating for a long time because EFS volumes fail to mount (kubelet error "Unable to attach or mount volumes" [...] "timed out waiting for the condition") #765

Closed
jgoeres opened this issue Sep 15, 2022 · 15 comments · May be fixed by #1074
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jgoeres
Copy link

jgoeres commented Sep 15, 2022

Hi,
we are using the EFS CSI driver (currently version 1.3.2) to provision EFS-based volumes to our workloads.
On one of our clusters is currently suffering from a situation where freshly deployed pods that mount such volumes are stuck in ContainerCreating (resp. "Init:0/" for pods with init containers) for a very long time. Pods that are part of that same workload but do not mount EFS volumes do not suffer from that, so it 99,9% related to the EFS CSI driver.

This is how the (somewhat anonymized) workload presents itself when it is in that stuck state:

$ kubectl get pods -n mynamespace -o wide
NAME                                      READY   STATUS              RESTARTS   AGE   IP             NODE                                         
abc-0                                     0/1     ContainerCreating   0          14m   <none>         ip-10-0-138-106.eu-central-1.compute.internal
abc-1                                     0/1     ContainerCreating   0          14m   <none>         ip-10-0-142-147.eu-central-1.compute.internal
foobar-0                                  0/1     ContainerCreating   0          14m   <none>         ip-10-0-138-106.eu-central-1.compute.internal
foobar-1                                  0/1     ContainerCreating   0          14m   <none>         ip-10-0-146-19.eu-central-1.compute.internal 
job-without-efs-mounts-43611-jzqth        1/1     Running             0          14m   10.0.145.110   ip-10-0-145-28.eu-central-1.compute.internal 
other-job-without-efs-mounts-43611-6qfbv  1/1     Running             0          14m   10.0.147.196   ip-10-0-145-28.eu-central-1.compute.internal 
lala-default-0                            0/1     Init:0/1            0          14m   <none>         ip-10-0-142-147.eu-central-1.compute.internal
meme-default-0                            0/1     Init:0/2            0          14m   <none>         ip-10-0-145-28.eu-central-1.compute.internal    
[...]
meme-default-2                            0/1     Init:0/2            0          14m   <none>         ip-10-0-143-249.eu-central-1.compute.internal    

As an example, these are the events for the pod meme-default-2 while the pod is in this state (note that the volume that does attach immediately without problems is an EBS volume, handled by the EBS-CSI driver):

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        24m                  default-scheduler        0/7 nodes are available: 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 4 Insufficient cpu.
  Normal   TriggeredScaleUp        24m                  cluster-autoscaler       pod triggered scale-up: [{eks-xxx-default2-integral-thrush-7cc12cf7-ecbb-2a1c-fb85-efbf1546dc08 1->2 (max: 99)}]
  Warning  FailedScheduling        23m (x2 over 23m)    default-scheduler        0/8 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 4 Insufficient cpu.
  Warning  FailedScheduling        22m (x2 over 23m)    default-scheduler        0/8 nodes are available: 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 3 node(s) had volume node affinity conflict, 4 Insufficient cpu.
  Normal   Scheduled               22m                  default-scheduler        Successfully assigned mynamespace/meme-default-3 to ip-10-0-143-249.eu-central-1.compute.internal
  Normal   SuccessfulAttachVolume  22m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-dcc8cc2b-8729-5378-080d-c1639e0e8a1e"
  Warning  FailedMount             16m                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[tmp feature-toggle data-volume some-efs-volume kube-api-access-prc4n]: timed out waiting for the condition
  Warning  FailedMount             13m (x2 over 18m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[feature-toggle data-volume some-efs-volume kube-api-access-prc4n tmp]: timed out waiting for the condition
  Warning  FailedMount             2m39s                kubelet                  Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[some-efs-volume kube-api-access-prc4n tmp feature-toggle data-volume]: timed out waiting for the condition
  Warning  FailedAttachVolume      2m35s (x9 over 20m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-8ff25a34-6018-47a6-b11d-818cb39af55f" : Attach timeout for volume fs-0aaeaaaaaaaaaa::fsap-095aaaaaaaaaaaa
  Warning  FailedMount             25s (x6 over 20m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[kube-api-access-prc4n tmp feature-toggle data-volume some-efs-volume]: timed out waiting for the condition

Note that in this example, the cluster autoscaler did perform a scale-up, but the issue also occurs on pods scheduled on already existing nodes. So I don't think that the autoscaler is involved in the problem.

The EFS CSI node pod on the node on which the above pod is scheduled logs no obvious errors (at least for someone not familiar with the inner workings of the EFS CSI driver)

$ kubectl logs -n management efs-csi-node-2kqrl efs-plugin
I0915 14:24:39.398396       1 config_dir.go:87] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0915 14:24:39.401588       1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0915 14:24:39.401601       1 driver.go:140] Did not find any input tags.
I0915 14:24:39.401706       1 driver.go:113] Registering Node Server
I0915 14:24:39.401715       1 driver.go:115] Registering Controller Server
I0915 14:24:39.401722       1 driver.go:118] Starting watchdog
I0915 14:24:39.401762       1 efs_watch_dog.go:209] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0915 14:24:39.401814       1 efs_watch_dog.go:209] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0915 14:24:39.402607       1 driver.go:124] Staring subreaper
I0915 14:24:39.402622       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}

$ kubectl logs -n management efs-csi-node-2kqrl csi-driver-registrar
I0915 14:24:45.939706       1 main.go:113] Version: v2.1.0-0-g80d42f24
I0915 14:24:45.940147       1 main.go:137] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0915 14:24:45.940161       1 connection.go:153] Connecting to unix:///csi/csi.sock
I0915 14:24:45.940473       1 main.go:144] Calling CSI driver to discover driver name
I0915 14:24:45.942572       1 main.go:154] CSI driver name: "efs.csi.aws.com"
I0915 14:24:45.942594       1 node_register.go:52] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0915 14:24:45.942679       1 node_register.go:61] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0915 14:24:45.942721       1 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0915 14:24:46.371532       1 main.go:80] Received GetInfo call: &InfoRequest{}
I0915 14:24:46.402238       1 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Eventually, the attaching/mounting of the EFS volumes will succeed, this can be 10-15 minutes, but sometimes hours.
Usually, when the mounting works, it will work for all pods that are currently stuck. But the problem is not gone - when I later scale up a workload (or have new pod launched by, e.g., a cronjob), these new pods will often be stuck again.
For example, here we have the pods of a cronjob (running once an hour) not being scheduled for more than two hours because of this problem. Scaling up the "meme" workload to 4 instances has the new pod No. 3 stuck again:

[...]
some-cronjob-27720780-ltkmt   0/1     ContainerCreating   0              133m   <none>         ip-10-0-147-174.eu-central-1.compute.internal   <none>           <none>
some-cronjob-27720840-h9zcw   0/1     ContainerCreating   0              73m    <none>         ip-10-0-147-75.eu-central-1.compute.internal    <none>           <none>
some-cronjob-27720900-fgmmm   0/1     ContainerCreating   0              13m    <none>         ip-10-0-147-174.eu-central-1.compute.internal   <none>           <none>
[...]
meme-default-0                1/1     Running             0              3h1m   10.0.146.62    ip-10-0-145-28.eu-central-1.compute.internal    <none>           <none>
meme-default-1                1/1     Running             0              163m   10.0.144.232   ip-10-0-147-174.eu-central-1.compute.internal   <none>           <none>
meme-default-2                1/1     Running             0              160m   10.0.146.237   ip-10-0-147-75.eu-central-1.compute.internal    <none>           <none>
meme-default-3                0/1     Init:0/2            0              49m    <none>         ip-10-0-143-249.eu-central-1.compute.internal   <none>           <none>

Restarting the EFS-CSI driver pods (both the efs-csi-node DS and efs-csi-controller deployment) sometimes seemed to help, currently it doesn't. Restarting all nodes temporarily fixed it, but the problem will later occur again.

I mentioned that we are observing this in one cluster only at this time. What separates this cluster from others is that only on this cluster, we have a high "workload churn" - the cluster runs several deployments of our application in different namespaces, which are refreshed (i.e. deleted and recreated) several times a day. This deletion includes the EFS-based volumes (we implicitly delete their PVCs by deleting the namespace. The storage class we use for dynamic provisioning has its Reclaim Policy set to Delete, so PVs are also deleted, as are the associated EFS Access Points.
On most of our other clusters, we create deployments and then use them for a longer period of time, only performing minor changes (e.g., rollout patches), but keeping the EFS volumes.

@mtavaresmedeiros
Copy link

I am with the same problem.
EFS version 1.4.2, cluster and nodes in kuberentes version 1.19.

@jgoeres
Did you find any solution?

Thanks

@BigbigY
Copy link

BigbigY commented Oct 4, 2022

Me too

@ktleung2017
Copy link

Try setting resources requests for containers. Haven't seen this error for quite a while after adding them.
#325 (comment)

@sfc-gh-mkmak
Copy link

We have experienced the same problem on one of our clusters with high workload. We already have setup resources request, but this doesn't help.
EFS driver version: 1.4.0
k8s version: 1.21

@gilbrown123
Copy link

we have the same issue
EKS: 1.21.14
EFS Driver Version 1.4.0

@Bruce-Lu674
Copy link

Meet the same problem
EKS Version : 1.21
EFS Driver Version 1.4.0

@RyanStan
Copy link
Contributor

This issue might be resolved by upgrading to the latest driver version, v1.4.9. In v1.4.8, we fixed a concurrency issue with efs-utils that could cause this to happen.

If anyone runs into this again, can you please follow the troubleshooting guide to enable efs-utils debug logging, execute the log collector script, and then post any relevant errors from the mount.log file? This file contains the logs for efs-utils, which is doing the actual mounting "under the hood" of the csi driver.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2023
@wmgroot
Copy link

wmgroot commented Jun 20, 2023

I'm noticing this problem on EFS CSI v1.5.6.

Pod Event Error

  Warning  FailedAttachVolume  107s (x9 over 19m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-8a00b9f5-58e0-4e2d-a294-8a9c45e57a1a" : timed out waiting for external-attacher of efs.csi.aws.com CSI driver to attach volume fs-2a825351::fsap-0d75583a12ada3174

These are the log dumps from the log_collector.py tool.

driver_info

kubectl describe pod efs-csi-node-w4sl9 -n kube-system

Name:                 efs-csi-node-w4sl9
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      efs-csi-node-sa
Node:                 ip-10-116-161-48.us-east-2.compute.internal/10.116.161.48
Start Time:           Thu, 15 Jun 2023 20:05:42 -0500
Labels:               app=efs-csi-node
                      app.kubernetes.io/instance=efs-csi-awscmh2
                      app.kubernetes.io/name=aws-efs-csi-driver
                      controller-revision-hash=7dbf8cbdd4
                      pod-template-generation=7
Annotations:          apps.indeed.com/ship-logs: true
                      kubernetes.io/psp: privileged
                      vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
                      vpaUpdates:
                        Pod resources updated by efs-csi-node: container 0: cpu request, memory request; container 1: cpu request, memory request; container 2: cp...
Status:               Running
IP:                   10.116.161.48
IPs:
  IP:           10.116.161.48
Controlled By:  DaemonSet/efs-csi-node
Containers:
  efs-plugin:
    Container ID:  containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
    Image:         harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
    Image ID:      harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
    Port:          9809/TCP
    Host Port:     9809/TCP
    Args:
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --v=5
      --vol-metrics-opt-in=false
      --vol-metrics-refresh-period=240
      --vol-metrics-fs-rate-limit=5
    State:          Running
      Started:      Thu, 15 Jun 2023 20:05:48 -0500
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  128Mi
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=2s #success=1 #failure=5
    Environment:
      CSI_ENDPOINT:  unix:/csi/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /etc/amazon/efs-legacy from efs-utils-config-legacy (rw)
      /var/amazon/efs from efs-utils-config (rw)
      /var/lib/kubelet from kubelet-dir (rw)
      /var/run/efs from efs-state-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
  csi-driver-registrar:
    Container ID:  containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
    Image:         public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
    Image ID:      public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
      --v=5
    State:          Running
      Started:      Thu, 15 Jun 2023 20:05:49 -0500
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  128Mi
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
      KUBE_NODE_NAME:         (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
  liveness-probe:
    Container ID:  containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
    Image:         public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
    Image ID:      public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/csi/csi.sock
      --health-port=9809
      --v=5
    State:          Running
      Started:      Thu, 15 Jun 2023 20:05:50 -0500
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /csi from plugin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/efs.csi.aws.com/
    HostPathType:  DirectoryOrCreate
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
  efs-state-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/efs
    HostPathType:  DirectoryOrCreate
  efs-utils-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/amazon/efs
    HostPathType:  DirectoryOrCreate
  efs-utils-config-legacy:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/amazon/efs
    HostPathType:  DirectoryOrCreate
  kube-api-access-xw45q:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>


kubectl get pod efs-csi-node-w4sl9 -n kube-system -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    apps.indeed.com/ship-logs: "true"
    kubernetes.io/psp: privileged
    vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
    vpaUpdates: 'Pod resources updated by efs-csi-node: container 0: cpu request,
      memory request; container 1: cpu request, memory request; container 2: cpu request,
      memory request'
  creationTimestamp: "2023-06-16T01:05:42Z"
  generateName: efs-csi-node-
  labels:
    app: efs-csi-node
    app.kubernetes.io/instance: efs-csi-awscmh2
    app.kubernetes.io/name: aws-efs-csi-driver
    controller-revision-hash: 7dbf8cbdd4
    pod-template-generation: "7"
  name: efs-csi-node-w4sl9
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: efs-csi-node
    uid: aa1527ec-97b6-498c-a21d-9a642d26c242
  resourceVersion: "2386821689"
  uid: eccdbf2a-3285-4adc-8ad2-c7ba68c33f02
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - ip-10-116-161-48.us-east-2.compute.internal
  containers:
  - args:
    - --endpoint=$(CSI_ENDPOINT)
    - --logtostderr
    - --v=5
    - --vol-metrics-opt-in=false
    - --vol-metrics-refresh-period=240
    - --vol-metrics-fs-rate-limit=5
    env:
    - name: CSI_ENDPOINT
      value: unix:/csi/csi.sock
    image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthz
        port: healthz
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 3
    name: efs-plugin
    ports:
    - containerPort: 9809
      hostPort: 9809
      name: healthz
      protocol: TCP
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet
      mountPropagation: Bidirectional
      name: kubelet-dir
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/run/efs
      name: efs-state-dir
    - mountPath: /var/amazon/efs
      name: efs-utils-config
    - mountPath: /etc/amazon/efs-legacy
      name: efs-utils-config-legacy
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xw45q
      readOnly: true
  - args:
    - --csi-address=$(ADDRESS)
    - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    - --v=5
    env:
    - name: ADDRESS
      value: /csi/csi.sock
    - name: DRIVER_REG_SOCK_PATH
      value: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
    - name: KUBE_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
    imagePullPolicy: IfNotPresent
    name: csi-driver-registrar
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /registration
      name: registration-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xw45q
      readOnly: true
  - args:
    - --csi-address=/csi/csi.sock
    - --health-port=9809
    - --v=5
    image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
    imagePullPolicy: IfNotPresent
    name: liveness-probe
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /csi
      name: plugin-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xw45q
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: ip-10-116-161-48.us-east-2.compute.internal
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 0
    runAsGroup: 0
    runAsNonRoot: false
    runAsUser: 0
  serviceAccount: efs-csi-node-sa
  serviceAccountName: efs-csi-node-sa
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /var/lib/kubelet
      type: Directory
    name: kubelet-dir
  - hostPath:
      path: /var/lib/kubelet/plugins/efs.csi.aws.com/
      type: DirectoryOrCreate
    name: plugin-dir
  - hostPath:
      path: /var/lib/kubelet/plugins_registry/
      type: Directory
    name: registration-dir
  - hostPath:
      path: /var/run/efs
      type: DirectoryOrCreate
    name: efs-state-dir
  - hostPath:
      path: /var/amazon/efs
      type: DirectoryOrCreate
    name: efs-utils-config
  - hostPath:
      path: /etc/amazon/efs
      type: DirectoryOrCreate
    name: efs-utils-config-legacy
  - name: kube-api-access-xw45q
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-06-16T01:05:42Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-06-16T01:05:51Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-06-16T01:05:51Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-06-16T01:05:42Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
    image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
    imageID: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
    lastState: {}
    name: csi-driver-registrar
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-06-16T01:05:49Z"
  - containerID: containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
    image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
    imageID: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
    lastState: {}
    name: efs-plugin
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-06-16T01:05:48Z"
  - containerID: containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
    image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
    imageID: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
    lastState: {}
    name: liveness-probe
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-06-16T01:05:50Z"
  hostIP: 10.116.161.48
  phase: Running
  podIP: 10.116.161.48
  podIPs:
  - ip: 10.116.161.48
  qosClass: Burstable
  startTime: "2023-06-16T01:05:42Z"

driver_logs

kubectl logs efs-csi-node-w4sl9 -n kube-system efs-plugin

I0616 01:05:48.928661       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0616 01:05:48.929567       1 metadata.go:63] getting MetadataService...
I0616 01:05:48.931589       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0616 01:05:48.932454       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.us-east-2.amazonaws.com
I0616 01:05:48.932478       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0616 01:05:48.932588       1 driver.go:140] Did not find any input tags.
I0616 01:05:48.932739       1 driver.go:113] Registering Node Server
I0616 01:05:48.932752       1 driver.go:115] Registering Controller Server
I0616 01:05:48.932758       1 driver.go:118] Starting efs-utils watchdog
I0616 01:05:48.932833       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0616 01:05:48.932846       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0616 01:05:48.933148       1 driver.go:124] Starting reaper
I0616 01:05:48.933167       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0616 01:05:50.285468       1 node.go:306] NodeGetInfo: called with args 

efs_utils_logs (something seems wrong here)

kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/log/amazon/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;

find: 'echo': No such file or directory

efs_utils_state_dir

kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;

mounts

kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- mount |grep nfs 

@wmgroot
Copy link

wmgroot commented Jun 22, 2023

After further digging in our case, we noticed that the CSIDriver resource was missing in the cluster where the problem above was occurring. We have no idea why it's missing, but manually recreating it caused the controller to start working again.

This doesn't seem to be the first time an issue with the CSIDriver resource was noticed during a helm upgrade.
#325 (comment)

@woehrl01
Copy link

woehrl01 commented Jul 25, 2023

@wmgroot I just experienced the same issue, are you using ArgoCD? I'm still debugging the behaviour, but I can reproduce a "Delete CSIDriver"-diff.

I believe it's related to how helm hooks are used in the chart for that resource and how ArgoCD is handling them.

Bildschirmfoto 2023-07-25 um 19 54 55

@wmgroot
Copy link

wmgroot commented Jul 25, 2023

We are using ArgoCD to manage our EFS CSI installation, yes.
We check our Argo diffs as part of our upgrade process and I do not remember seeing a deletion of the CSIDriver, but it's possible that we missed this during a previous upgrade or I wasn't paying enough attention.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.