While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

juliohm1978 · 2020-11-19T19:18:09Z

For a while now, a strange behavior has been lurking our infrastructure. Sometimes, at random periods, some Pods get stuck at "ContainerCreating" status and never start. For all we know, this happens at random intervals in different nodes in the Kubernetes cluster. Notebly, all Kunernetes nodes are essentially replicas of the same Ubuntu installation, and that only makes things weird. Even weirder is the fact that every node is using the exact same version of intalled packages and kernel.

The only similar issue we found was here: Azure/kubernetes-volume-drivers#45

A kubectl describe pods shows the error message:

Warning  FailedMount  15m (x16 over 33m)  kubelet  MountVolume.SetUp failed for volume "***" : mount command failed, status: Failure, reason: Error: Error running cmd [cmd=/bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /var/lib/kubelet/pods/aabac208-086e-4e06-bcef-dc94594fe406/volumes/juliohm~cifs/***] [response=mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

Initial investigation shows the remote Samba share host is not down. It is working well and ready. A great number of Google search results point toward a protocol mismatch between client and server.

Example: https://serverfault.com/questions/414074/mount-cifs-host-is-down

For most people, changing the protocol number in the mount options works at the client side. However, in our case, it only works temporarily. After a while, the problem is back with the same error message. Changing it back to the previous protocol version also works, but that is not a solution. All the while, other Kubernetes nodes seem to be able to mount the volume and start replicas of the same Pod without any issues.

Further investigation also reveals log messages from the Windows Samba host. Under Windows Event Viewer, it is possible to see the server refusing to accept a large number of connection attempts with the error message:

The server denied anonymous access to the client

At the client side (kubernetes nodes), the mount command can be issued in a separate indepedent shell as root:

# /bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /tmp/mnt
mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

A tcpdump also reveals another interesting fact. The linux client issuing this mount command is really not sending any credentias... at all! Regardless of the valid command line arguments.

Finally, running ps aux | grep mount, we discovered a mount instance hanging in memory, meaning one of the mount commands from a previous Pod did not terminate and was left hanging for several days. Only by killing that particular mount instance with kill -9, subsequent mount command lines began to work again.

The root cause of this behavior is most likely a bug. Somewhere between the mount executable to the kernel CIFS module, something hits an unpredictable situation. We can only confirm that a problematic mount instance left hanging eventually leads to issues on other instances of the same program.

This issue is being created here to document a PR that is being developed for this project. Instead of simply running the mount command, this driver will initiate that process with a deadline context. After waiting a default 60s, the mount instance will be killed and the driver will return an error. This should prevent long running instances of the mount command, lingering for unknown reasons. The timeout will be adjustable via the Pod's volume Yaml configuration.

A beta version of the modified driver has been running in our infrastructure without issues. So far, no more hangin mount commands have appeared.

juliohm/kubernetes-cifs-volumedriver-installer:2.4-beta

The text was updated successfully, but these errors were encountered:

juliohm1978 · 2021-04-16T03:59:34Z

Won't fix. Issues does not seem related to the volume driver. It also appears in other implementations.

Azure/kubernetes-volume-drivers#45

Using more detailed arguments in the mount option seems to workaround the issue.

sec=ntlmv2,vers=3.0

juliohm1978 added the enhancement New feature or request label Nov 19, 2020

juliohm1978 mentioned this issue Nov 19, 2020

Node unable to communicate with Windows file share after unscheduled pod deletion Azure/kubernetes-volume-drivers#45

Open

juliohm1978 closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

juliohm1978 commented Nov 19, 2020 •

edited

Loading

juliohm1978 commented Apr 16, 2021

While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

Comments

juliohm1978 commented Nov 19, 2020 • edited Loading

juliohm1978 commented Apr 16, 2021

juliohm1978 commented Nov 19, 2020 •

edited

Loading