You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a while now, a strange behavior has been lurking our infrastructure. Sometimes, at random periods, some Pods get stuck at "ContainerCreating" status and never start. For all we know, this happens at random intervals in different nodes in the Kubernetes cluster. Notebly, all Kunernetes nodes are essentially replicas of the same Ubuntu installation, and that only makes things weird. Even weirder is the fact that every node is using the exact same version of intalled packages and kernel.
Warning FailedMount 15m (x16 over 33m) kubelet MountVolume.SetUp failed for volume "***" : mount command failed, status: Failure, reason: Error: Error running cmd [cmd=/bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /var/lib/kubelet/pods/aabac208-086e-4e06-bcef-dc94594fe406/volumes/juliohm~cifs/***] [response=mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)
Initial investigation shows the remote Samba share host is not down. It is working well and ready. A great number of Google search results point toward a protocol mismatch between client and server.
For most people, changing the protocol number in the mount options works at the client side. However, in our case, it only works temporarily. After a while, the problem is back with the same error message. Changing it back to the previous protocol version also works, but that is not a solution. All the while, other Kubernetes nodes seem to be able to mount the volume and start replicas of the same Pod without any issues.
Further investigation also reveals log messages from the Windows Samba host. Under Windows Event Viewer, it is possible to see the server refusing to accept a large number of connection attempts with the error message:
The server denied anonymous access to the client
At the client side (kubernetes nodes), the mount command can be issued in a separate indepedent shell as root:
# /bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /tmp/mnt
mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)
A tcpdump also reveals another interesting fact. The linux client issuing this mount command is really not sending any credentias... at all! Regardless of the valid command line arguments.
Finally, running ps aux | grep mount, we discovered a mount instance hanging in memory, meaning one of the mount commands from a previous Pod did not terminate and was left hanging for several days. Only by killing that particular mount instance with kill -9, subsequent mount command lines began to work again.
The root cause of this behavior is most likely a bug. Somewhere between the mount executable to the kernel CIFS module, something hits an unpredictable situation. We can only confirm that a problematic mount instance left hanging eventually leads to issues on other instances of the same program.
This issue is being created here to document a PR that is being developed for this project. Instead of simply running the mount command, this driver will initiate that process with a deadline context. After waiting a default 60s, the mount instance will be killed and the driver will return an error. This should prevent long running instances of the mount command, lingering for unknown reasons. The timeout will be adjustable via the Pod's volume Yaml configuration.
A beta version of the modified driver has been running in our infrastructure without issues. So far, no more hangin mount commands have appeared.
For a while now, a strange behavior has been lurking our infrastructure. Sometimes, at random periods, some Pods get stuck at "ContainerCreating" status and never start. For all we know, this happens at random intervals in different nodes in the Kubernetes cluster. Notebly, all Kunernetes nodes are essentially replicas of the same Ubuntu installation, and that only makes things weird. Even weirder is the fact that every node is using the exact same version of intalled packages and kernel.
The only similar issue we found was here: Azure/kubernetes-volume-drivers#45
A
kubectl describe pods
shows the error message:Initial investigation shows the remote Samba share host is not down. It is working well and ready. A great number of Google search results point toward a protocol mismatch between client and server.
Example: https://serverfault.com/questions/414074/mount-cifs-host-is-down
For most people, changing the protocol number in the mount options works at the client side. However, in our case, it only works temporarily. After a while, the problem is back with the same error message. Changing it back to the previous protocol version also works, but that is not a solution. All the while, other Kubernetes nodes seem to be able to mount the volume and start replicas of the same Pod without any issues.
Further investigation also reveals log messages from the Windows Samba host. Under Windows Event Viewer, it is possible to see the server refusing to accept a large number of connection attempts with the error message:
At the client side (kubernetes nodes), the mount command can be issued in a separate indepedent shell as root:
A tcpdump also reveals another interesting fact. The linux client issuing this mount command is really not sending any credentias... at all! Regardless of the valid command line arguments.
Finally, running
ps aux | grep mount
, we discovered a mount instance hanging in memory, meaning one of the mount commands from a previous Pod did not terminate and was left hanging for several days. Only by killing that particular mount instance withkill -9
, subsequent mount command lines began to work again.The root cause of this behavior is most likely a bug. Somewhere between the mount executable to the kernel CIFS module, something hits an unpredictable situation. We can only confirm that a problematic mount instance left hanging eventually leads to issues on other instances of the same program.
This issue is being created here to document a PR that is being developed for this project. Instead of simply running the mount command, this driver will initiate that process with a deadline context. After waiting a default 60s, the mount instance will be killed and the driver will return an error. This should prevent long running instances of the mount command, lingering for unknown reasons. The timeout will be adjustable via the Pod's volume Yaml configuration.
A beta version of the modified driver has been running in our infrastructure without issues. So far, no more hangin mount commands have appeared.
The text was updated successfully, but these errors were encountered: