-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods mounting EFS-CSI-driver-based volumes stuck in ContainerCreating for a long time because EFS volumes fail to mount (kubelet error "Unable to attach or mount volumes" [...] "timed out waiting for the condition") #765
Comments
I am with the same problem. @jgoeres Thanks |
Me too |
Try setting resources requests for containers. Haven't seen this error for quite a while after adding them. |
We have experienced the same problem on one of our clusters with high workload. We already have setup resources request, but this doesn't help. |
we have the same issue |
Meet the same problem |
This issue might be resolved by upgrading to the latest driver version, v1.4.9. In v1.4.8, we fixed a concurrency issue with efs-utils that could cause this to happen. If anyone runs into this again, can you please follow the troubleshooting guide to enable efs-utils debug logging, execute the log collector script, and then post any relevant errors from the |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I'm noticing this problem on EFS CSI v1.5.6. Pod Event Error
These are the log dumps from the log_collector.py tool. driver_info
driver_logs
efs_utils_logs (something seems wrong here)
efs_utils_state_dir
mounts
|
After further digging in our case, we noticed that the CSIDriver resource was missing in the cluster where the problem above was occurring. We have no idea why it's missing, but manually recreating it caused the controller to start working again. This doesn't seem to be the first time an issue with the CSIDriver resource was noticed during a helm upgrade. |
@wmgroot I just experienced the same issue, are you using ArgoCD? I'm still debugging the behaviour, but I can reproduce a "Delete CSIDriver"-diff. I believe it's related to how helm hooks are used in the chart for that resource and how ArgoCD is handling them. |
We are using ArgoCD to manage our EFS CSI installation, yes. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi,
we are using the EFS CSI driver (currently version 1.3.2) to provision EFS-based volumes to our workloads.
On one of our clusters is currently suffering from a situation where freshly deployed pods that mount such volumes are stuck in ContainerCreating (resp. "Init:0/" for pods with init containers) for a very long time. Pods that are part of that same workload but do not mount EFS volumes do not suffer from that, so it 99,9% related to the EFS CSI driver.
This is how the (somewhat anonymized) workload presents itself when it is in that stuck state:
As an example, these are the events for the pod meme-default-2 while the pod is in this state (note that the volume that does attach immediately without problems is an EBS volume, handled by the EBS-CSI driver):
Note that in this example, the cluster autoscaler did perform a scale-up, but the issue also occurs on pods scheduled on already existing nodes. So I don't think that the autoscaler is involved in the problem.
The EFS CSI node pod on the node on which the above pod is scheduled logs no obvious errors (at least for someone not familiar with the inner workings of the EFS CSI driver)
Eventually, the attaching/mounting of the EFS volumes will succeed, this can be 10-15 minutes, but sometimes hours.
Usually, when the mounting works, it will work for all pods that are currently stuck. But the problem is not gone - when I later scale up a workload (or have new pod launched by, e.g., a cronjob), these new pods will often be stuck again.
For example, here we have the pods of a cronjob (running once an hour) not being scheduled for more than two hours because of this problem. Scaling up the "meme" workload to 4 instances has the new pod No. 3 stuck again:
Restarting the EFS-CSI driver pods (both the efs-csi-node DS and efs-csi-controller deployment) sometimes seemed to help, currently it doesn't. Restarting all nodes temporarily fixed it, but the problem will later occur again.
I mentioned that we are observing this in one cluster only at this time. What separates this cluster from others is that only on this cluster, we have a high "workload churn" - the cluster runs several deployments of our application in different namespaces, which are refreshed (i.e. deleted and recreated) several times a day. This deletion includes the EFS-based volumes (we implicitly delete their PVCs by deleting the namespace. The storage class we use for dynamic provisioning has its Reclaim Policy set to Delete, so PVs are also deleted, as are the associated EFS Access Points.
On most of our other clusters, we create deployments and then use them for a longer period of time, only performing minor changes (e.g., rollout patches), but keeping the EFS volumes.
The text was updated successfully, but these errors were encountered: