New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore 'wait: no child processes' error when calling mount/umount #103780
Conversation
@@ -267,6 +273,10 @@ func (mounter *Mounter) Unmount(target string) error { | |||
command := exec.Command("umount", target) | |||
output, err := command.CombinedOutput() | |||
if err != nil { | |||
if err.Error() == errNoChildProcesses { | |||
// We don't consider this an error (see - k/k issue #103753) | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we make sure this unmount must be successful at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
So exec:CombinedOutput() essentially runs exec:Run() which does this:
func (c *Cmd) Run() error {
if err := c.Start(); err != nil {
return err
}
return c.Wait()
}
This error can only come from c.Wait()
which mean c.Start() went through without any errors. Within c.Wait(), we run the following code:
func (c *Cmd) Wait() error {
if c.Process == nil {
return errors.New("exec: not started")
}
if c.finished {
return errors.New("exec: Wait was already called")
}
c.finished = true
state, err := c.Process.Wait()
if c.waitDone != nil {
close(c.waitDone)
}
c.ProcessState = state
var copyError error
for range c.goroutine {
if err := <-c.errch; err != nil && copyError == nil {
copyError = err
}
}
c.closeDescriptors(c.closeAfterWait)
if err != nil {
return err
} else if !state.Success() {
return &ExitError{ProcessState: state}
}
return copyError
}
And this errNoChildProcesses
(i.e ECHILD error) comes specifically from this line state, err := c.Process.Wait()
which internally makes an os syscall. But we seem to be returning that error before returning the error for the process exit state:
if err != nil {
return err
} else if !state.Success() {
return &ExitError{ProcessState: state}
}
So additionally if we check that c.ProcessState is a success, then we should be good. Like this:
if err.Error() == errNoChildProcesses && command.ProcessState.Success() {
// We don't consider this an error (see - k/k issue #103753)
return nil
}
Does the above sgty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it to a slightly better version actually to help expose the actual process error when we get errNoChildProcesses
:
if err.Error() == errNoChildProcesses {
if command.ProcessState.Success() {
// We don't consider errNoChildProcesses an error if the process itself succeeded (see - k/k issue #103753).
return nil
}
// Rewrite err with the actual exit error of the process.
err = &exec.ExitError{ProcessState: command.ProcessState}
}
8e9ad0a
to
6d988ac
Compare
From the issue description, the impact appears to be that mount/unmount could take up to 1s longer on heavily loaded nodes. I believe this exec race has also existed for many releases so is not considered a regression. @shyamjvs please correct me if I am mistaken. For now, we can target 1.23 for the fix. /triage accepted |
@msau42 Yes, that's right.
Sounds good |
@@ -176,6 +178,14 @@ func (mounter *Mounter) doMount(mounterPath string, mountCmd string, source stri | |||
command := exec.Command(mountCmd, mountArgs...) | |||
output, err := command.CombinedOutput() | |||
if err != nil { | |||
if err.Error() == errNoChildProcesses { | |||
if command.ProcessState.Success() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you able to test it with this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed this on a customer prod cluster and couldn't reproduce it on my end so far. Seems to depend on the volume usage pattern and pod churn on their nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - rather than comparing strings I wonder if we could have unwrapped the error and looked for syscall.ECHILD
error specifically?
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane, shyamjvs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
…80-upstream-release-1.22 Automated cherry pick of #103780: Ignore 'wait: no child processes' error when calling
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
We want kubernetes/kubernetes#103780 in the EFS CSI driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629 We need to carry it until upstream updates to k8s 1.23 or newer. This replaces 7caacdb: UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
We want kubernetes/kubernetes#103780 in the EFS CSI driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629 We need to carry it until upstream updates to k8s 1.23 or newer. This replaces 7caacdb: UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
…ng mount/umount Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in the CSI driver repo yet, marking as <carry>. To be carried in OCP until the EFS CSI driver upstream updates k8s.io/mount-utils v1.23
We want kubernetes/kubernetes#103780 in the EFS CSI driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629 We need to carry it until upstream updates to k8s 1.23 or newer. This replaces 7caacdb: UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
Fixes #103753
I've only fixed the exec commands that are part of Mount() and Unmount() functions and that too in the linux mount helper. Not touching others, since I'm not sure about the implications. Let me know if I should.
/kind bug
/sig storage
/sig scalability
/assign @liggitt