Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While mount command hangs indefinitely, subsequent mounts terminate with error "Host is down". #25

Closed
juliohm1978 opened this issue Nov 19, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@juliohm1978
Copy link
Owner

juliohm1978 commented Nov 19, 2020

For a while now, a strange behavior has been lurking our infrastructure. Sometimes, at random periods, some Pods get stuck at "ContainerCreating" status and never start. For all we know, this happens at random intervals in different nodes in the Kubernetes cluster. Notebly, all Kunernetes nodes are essentially replicas of the same Ubuntu installation, and that only makes things weird. Even weirder is the fact that every node is using the exact same version of intalled packages and kernel.

The only similar issue we found was here: Azure/kubernetes-volume-drivers#45

A kubectl describe pods shows the error message:

Warning  FailedMount  15m (x16 over 33m)  kubelet  MountVolume.SetUp failed for volume "***" : mount command failed, status: Failure, reason: Error: Error running cmd [cmd=/bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /var/lib/kubelet/pods/aabac208-086e-4e06-bcef-dc94594fe406/volumes/juliohm~cifs/***] [response=mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

Initial investigation shows the remote Samba share host is not down. It is working well and ready. A great number of Google search results point toward a protocol mismatch between client and server.

Example: https://serverfault.com/questions/414074/mount-cifs-host-is-down

For most people, changing the protocol number in the mount options works at the client side. However, in our case, it only works temporarily. After a while, the problem is back with the same error message. Changing it back to the previous protocol version also works, but that is not a solution. All the while, other Kubernetes nodes seem to be able to mount the volume and start replicas of the same Pod without any issues.

Further investigation also reveals log messages from the Windows Samba host. Under Windows Event Viewer, it is possible to see the server refusing to accept a large number of connection attempts with the error message:

The server denied anonymous access to the client

image

At the client side (kubernetes nodes), the mount command can be issued in a separate indepedent shell as root:

# /bin/mount -t cifs -o rw,username=***,password=***,uid=106,vers=3.02 //*** /tmp/mnt
mount error(112): Host is down
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

A tcpdump also reveals another interesting fact. The linux client issuing this mount command is really not sending any credentias... at all! Regardless of the valid command line arguments.

Finally, running ps aux | grep mount, we discovered a mount instance hanging in memory, meaning one of the mount commands from a previous Pod did not terminate and was left hanging for several days. Only by killing that particular mount instance with kill -9, subsequent mount command lines began to work again.

The root cause of this behavior is most likely a bug. Somewhere between the mount executable to the kernel CIFS module, something hits an unpredictable situation. We can only confirm that a problematic mount instance left hanging eventually leads to issues on other instances of the same program.

This issue is being created here to document a PR that is being developed for this project. Instead of simply running the mount command, this driver will initiate that process with a deadline context. After waiting a default 60s, the mount instance will be killed and the driver will return an error. This should prevent long running instances of the mount command, lingering for unknown reasons. The timeout will be adjustable via the Pod's volume Yaml configuration.

A beta version of the modified driver has been running in our infrastructure without issues. So far, no more hangin mount commands have appeared.

juliohm/kubernetes-cifs-volumedriver-installer:2.4-beta
@juliohm1978
Copy link
Owner Author

Won't fix. Issues does not seem related to the volume driver. It also appears in other implementations.

Azure/kubernetes-volume-drivers#45

Using more detailed arguments in the mount option seems to workaround the issue.

sec=ntlmv2,vers=3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant