Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix smb unmount issue on Windows #75087

Merged
merged 1 commit into from Mar 12, 2019

Conversation

@andyzhangx
Copy link
Member

commented Mar 7, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fix following issue:
pod with smb(cifs) mount could not be terminated on Windows when that smb account is invalid, the pod will be stuck forever

This PR is another windows fix related to #74625

Which issue(s) this PR fixes:

Fixes #75025

Special notes for your reviewer:
This PR implement IsCorruptedMnt func on Windows, unix code is actully unchanged.
The warnning logs in this PR would be like following in kubelet which is expected:

W0308 05:53:34.344557    2392 mount_helper_windows.go:52] IsCorruptedMnt failed with ERROR_LOGON_FAILURE error: CreateFile c:\var\lib\kubelet\pods\4fe70e4a-4166-11e9-bf93-000d3af95268\volumes\kubernetes.io~azure-file\pvc-4fe1243c-4166-11e9-bf93-000d3af95268: The user name or password is incorrect.
W0308 05:53:34.347757    2392 mount_helper_windows.go:52] IsCorruptedMnt failed with ERROR_LOGON_FAILURE error: CreateFile c:\var\lib\kubelet\pods\4fe70e4a-4166-11e9-bf93-000d3af95268\volumes\kubernetes.io~azure-file\pvc-4fe1243c-4166-11e9-bf93-000d3af95268: The user name or password is incorrect.
I0308 05:53:34.347757    2392 mount_helper_common.go:74] "c:\\var\\lib\\kubelet\\pods\\4fe70e4a-4166-11e9-bf93-000d3af95268\\volumes\\kubernetes.io~azure-file\\pvc-4fe1243c-4166-11e9-bf93-000d3af95268" is a mountpoint, unmounting
I0308 05:53:34.347757    2392 mount_windows.go:149] azureMount: Unmount target ("c:\\var\\lib\\kubelet\\pods\\4fe70e4a-4166-11e9-bf93-000d3af95268\\volumes\\kubernetes.io~azure-file\\pvc-4fe1243c-4166-11e9-bf93-000d3af95268")

Does this PR introduce a user-facing change?:

fix smb unmount issue on Windows

/sig windows
/priority important-soon
/assign @jingxu97 @davidz627 @msau42

cc @feiskyer could you mark milestone of this PR, thanks.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2019

/test pull-kubernetes-cross

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and msau42 Mar 7, 2019

@feiskyer

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 7, 2019

return false
}

if strings.Contains(err.Error(), SMBConnectionError) {

This comment has been minimized.

Copy link
@feiskyer

feiskyer Mar 7, 2019

Member

Is "password is incorrect" the only case for corrupt? Are there any other possibilities?

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Mar 7, 2019

Author Member

good question, there must be other possiblities, while I don't know.
another fix is write like this which may have drawbacks, it depends on whether we use whiltelist or blacklist:

func IsCorruptedMnt(err error) bool {
    return err != nil
}

I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func:

func IsCorruptedMnt(err error) bool {
if err == nil {
return false
}
var underlyingError error
switch pe := err.(type) {
case nil:
return false
case *os.PathError:
underlyingError = pe.Err
case *os.LinkError:
underlyingError = pe.Err
case *os.SyscallError:
underlyingError = pe.Err
}
return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES

@jingxu97 @msau42 @davidz627 what do you think about this? why way do you prefer on Windows?

This comment has been minimized.

Copy link
@msau42

msau42 Mar 7, 2019

Member

invalid password seems like a strange scenario to consider as a "corrupted" mount. Is that a normal error that users may encounter? Is it recoverable?

This comment has been minimized.

Copy link
@PatrickLang

PatrickLang Mar 7, 2019

Contributor

I think these could also be returned

  • ERROR_NETWORK_ACCESS_DENIED
  • ERROR_BAD_NET_NAME
  • ERROR_NETNAME_DELETED

(reference list: https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--0-499-)

This comment has been minimized.

Copy link
@neolit123

neolit123 Mar 7, 2019

Member

@andyzhangx

I think there are some errors still missing on Linux implementation since only following 4 errors caught in IsCorruptedMnt func

this command uses Stat() -> fstatat() on *unix
and i think the list of existing errors is odd.

here is what fstatat returns:
http://man7.org/linux/man-pages/man3/fstatat.3p.html


on Windows Stat() uses the stat() function here:
https://golang.org/src/os/stat_windows.go

so it will throw a PathError at least. i don't know where SMBConnectionError comes from.

in terms of Windows error codes @PatrickLang already listed some, i don't know the codes of a corrupted mklink / CreateSymbolicLink, but here is what you can do to check against them:

package main

import (
	"fmt"
	"golang.org/x/sys/windows"
	"syscall"
)

func main() {
	ptr, err := windows.UTF16PtrFromString("./some-missing-path")
	if err != nil {
		fmt.Println(err)
		return
	}
	attrib, err := windows.GetFileAttributes(ptr)
	if err != nil {
		fmt.Println(err.(syscall.Errno) == windows.ERROR_FILE_NOT_FOUND) // or any error code...
		return
	}
	fmt.Println(attrib)
}

This comment has been minimized.

Copy link
@neolit123

neolit123 Mar 8, 2019

Member

and then error info The user name or password is incorrect does not exists in types_windows.go

we can still verify against the integer (ERROR_LOGON_FAILURE, 1326)
https://docs.microsoft.com/en-us/windows/desktop/debug/system-error-codes--1300-1699-

e.g.

const ERROR_LOGON_FAILURE = 1326
... 
err.(syscall.Errno) == ERROR_LOGON_FAILURE

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Mar 8, 2019

Author Member

@neolit123 thanks, it's 1040, not in the list of https://docs.microsoft.com/en-us/windows/desktop/Debug/system-error-codes--1000-1299-, so shall I go with 1040 error checking?

This comment has been minimized.

Copy link
@neolit123

neolit123 Mar 8, 2019

Member

hm, i don't know what 1040 is, but probably best not to check for it.
golang just calls native functions and those follow only the existing error codes.

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Mar 8, 2019

Author Member

@neolit123 sorry, I was wrong, it's this error:

ERROR_LOGON_FAILURE
1326 (0x52E)
The user name or password is incorrect.

I think I have the correct solution now, I have this error check in the code first, thanks!

This comment has been minimized.

Copy link
@michmike

michmike Mar 8, 2019

Contributor

that's great that it matches the real error codes from windows

@andyzhangx andyzhangx force-pushed the andyzhangx:unmount-issue-windows branch from 251340f to 48aad50 Mar 7, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2019

/test pull-kubernetes-cross

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2019

/test pull-kubernetes-e2e-gce
/test pull-kubernetes-integration

@michmike michmike added this to 1.14 Release Blocking (Windows GA, gMSA alpha) in SIG-Windows Mar 7, 2019

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

adding a hold until we identify all the proper mount failures on windows
/hold
@neolit123 can you also please review?

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

cc: @benmoss for review as well

@PatrickLang

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

Just like #74625, this would also need backport to 1.11-1.13.

@PatrickLang

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

cc @KnicKnic - any feedback on identifying failed mounts?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

thanks guys, I used errorno check as @neolit123 suggested, pushed a new commit: d0d49da

BTW, this issue is a common failure in SMB user scenario, for some reason(e.g. security consideration), user changed the password of SMB server, while the original SMB mount could not be terminated forever, on Linux, it has been fixed by adding one more check:

underlyingError == syscall.EACCES

return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES

While it is not fixed on Windows yet.

It looks like it's very close to code freeze date, can we make it into 1.14? @michmike

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

thanks guys, I used errorno check as @neolit123 suggested, pushed a new commit: d0d49da

BTW, this issue is a common failure in SMB user scenario, for some reason(e.g. security consideration), user changed the password of SMB server, while the original SMB mount could not be terminated forever, on Linux, it has been fixed by adding one more check:

underlyingError == syscall.EACCES

kubernetes/pkg/util/mount/mount_helper.go

Line 123 in 788f245

return underlyingError == syscall.ENOTCONN || underlyingError == syscall.ESTALE || underlyingError == syscall.EIO || underlyingError == syscall.EACCES
While it is not fixed on Windows yet.

It looks like it's very close to code freeze date, can we make it into 1.14?

don't worry about code freeze. we need to fix this properly , test it and then we can see about 1.14

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

i would like @adelina-t, @bclau to also have the fix run in the same environment where it failed before

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

i would like @adelina-t, @bclau to also have the fix run in the same environment where it failed before

This fix only applies for SMB unmount issue on Windows, happens when password changed, do we have any test failure on this before? @michmike

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

i added some comments which you should resolve before removing the hold
/hold
/lgtm

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

pinging the sig-storage team to take a look at this as well

@msau42

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

Are these all errors that when if encountered, the mount point could still be successfully unmounted and cleaned up?

To give some context, the unmount sequence is like:

  1. Check if directory is a mount point <-- this is where the IsCorruptedMount check is done
  2. If it is a mount, then call Unmount
  3. If unmount was successful and the directory is no longer a mount point, remove the directory.
fix smb unmount issue on Windows
fix log warning

use IsCorruptedMnt in GetMountRefs on Windows

use errorno in IsCorruptedMnt check

fix comments: add more error code

add more error no checking

change year

fix comments

@andyzhangx andyzhangx force-pushed the andyzhangx:unmount-issue-windows branch from 92469f2 to 720a5e2 Mar 10, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 10, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2019

do all those errors map to the same one with 64 as the error code?

Partly correct, a lot of errors are mapping to 64, others are different error codes. @michmike

Are these all errors that when if encountered, the mount point could still be successfully unmounted and cleaned up?
To give some context, the unmount sequence is like:
Check if directory is a mount point <-- this is where the IsCorruptedMount check is done
If it is a mount, then call Unmount
If unmount was successful and the directory is no longer a mount point, remove the directory.

Yes, on Windows, all Unmount would be simply rmdir opertion, it will always succeed for those conditions(@msau42 ):

func (mounter *Mounter) Unmount(target string) error {
klog.V(4).Infof("azureMount: Unmount target (%q)", target)
target = normalizeWindowsPath(target)
if output, err := exec.Command("cmd", "/c", "rmdir", target).CombinedOutput(); err != nil {
klog.Errorf("rmdir failed: %v, output: %q", err, string(output))
return err

/hold cancel
/test pull-kubernetes-cross

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

/approve

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 10, 2019

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

@msau42 if this looks good to you as well, please approve and we will go through the motions of getting this approved for 1.14

@michmike

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

/assign @jingxu97

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

I have a questions of what is considered as corrupted mount point? In what conditions we should clean them up? For example, for error STATUS_SERVER_UNAVAILABLE, is that possible that server is just temporally unavailable, do we want to clean up the mount in this situation?

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

Talked with @msau42 and @saad-ali, for this PR, adding more error codes should be safe. If it is considered corrected mount point, it will move to the next step trying to unmount and then check mountpoint before removing files and directories.

But the whole process of checking correpted mount point before trying to unmount and remove files could be simplified. I propose to call umount directly, if it fails, checks whether the path is a mount point or not, if not, delete the files and directories. Please see #75273

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jingxu97, michmike

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 43b6ddf into kubernetes:master Mar 12, 2019

17 checks passed

cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

SIG-Windows automation moved this from 1.14 Release Blocking (Windows GA, gMSA alpha) to Done (v.1.14) Mar 12, 2019

k8s-ci-robot added a commit that referenced this pull request Apr 18, 2019

Merge pull request #75359 from andyzhangx/automated-cherry-pick-of-#7…
…5087-upstream-release-1.12

Automated cherry pick of #75087: fix smb unmount issue on Windows

k8s-ci-robot added a commit that referenced this pull request May 7, 2019

Merge pull request #75358 from andyzhangx/automated-cherry-pick-of-#7…
…5087-upstream-release-1.13

Automated cherry pick of #75087: fix smb unmount issue on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.