Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore 'wait: no child processes' error when calling mount/umount #103780

Merged
merged 1 commit into from Aug 5, 2021

Conversation

shyamjvs
Copy link
Member

@shyamjvs shyamjvs commented Jul 19, 2021

Fixes #103753

I've only fixed the exec commands that are part of Mount() and Unmount() functions and that too in the linux mount helper. Not touching others, since I'm not sure about the implications. Let me know if I should.

/kind bug
/sig storage
/sig scalability

/assign @liggitt

NONE

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 19, 2021
@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 19, 2021
@@ -267,6 +273,10 @@ func (mounter *Mounter) Unmount(target string) error {
command := exec.Command("umount", target)
output, err := command.CombinedOutput()
if err != nil {
if err.Error() == errNoChildProcesses {
// We don't consider this an error (see - k/k issue #103753)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we make sure this unmount must be successful at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

So exec:CombinedOutput() essentially runs exec:Run() which does this:

func (c *Cmd) Run() error {
	if err := c.Start(); err != nil {
		return err
	}
	return c.Wait()
}

This error can only come from c.Wait() which mean c.Start() went through without any errors. Within c.Wait(), we run the following code:

func (c *Cmd) Wait() error {
	if c.Process == nil {
		return errors.New("exec: not started")
	}
	if c.finished {
		return errors.New("exec: Wait was already called")
	}
	c.finished = true

	state, err := c.Process.Wait()
	if c.waitDone != nil {
		close(c.waitDone)
	}
	c.ProcessState = state

	var copyError error
	for range c.goroutine {
		if err := <-c.errch; err != nil && copyError == nil {
			copyError = err
		}
	}

	c.closeDescriptors(c.closeAfterWait)

	if err != nil {
		return err
	} else if !state.Success() {
		return &ExitError{ProcessState: state}
	}

	return copyError
}

And this errNoChildProcesses (i.e ECHILD error) comes specifically from this line state, err := c.Process.Wait() which internally makes an os syscall. But we seem to be returning that error before returning the error for the process exit state:

	if err != nil {
		return err
	} else if !state.Success() {
		return &ExitError{ProcessState: state}
	}

So additionally if we check that c.ProcessState is a success, then we should be good. Like this:

if err.Error() == errNoChildProcesses && command.ProcessState.Success() {
      // We don't consider this an error (see - k/k issue #103753)
      return nil
}

Does the above sgty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it to a slightly better version actually to help expose the actual process error when we get errNoChildProcesses:

		if err.Error() == errNoChildProcesses {
			if command.ProcessState.Success() {
				// We don't consider errNoChildProcesses an error if the process itself succeeded (see - k/k issue #103753).
				return nil
			}
			// Rewrite err with the actual exit error of the process.
			err = &exec.ExitError{ProcessState: command.ProcessState}
		}

@liggitt
Copy link
Member

liggitt commented Jul 20, 2021

/unassign
/assign @jsafrane @msau42
for determination whether this is fixing a release-blocking issue

@k8s-ci-robot k8s-ci-robot assigned jsafrane and msau42 and unassigned liggitt Jul 20, 2021
@msau42
Copy link
Member

msau42 commented Jul 20, 2021

From the issue description, the impact appears to be that mount/unmount could take up to 1s longer on heavily loaded nodes. I believe this exec race has also existed for many releases so is not considered a regression. @shyamjvs please correct me if I am mistaken.

For now, we can target 1.23 for the fix.

/triage accepted
/milestone v1.23

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 20, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Jul 20, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 20, 2021
@shyamjvs
Copy link
Member Author

shyamjvs commented Jul 20, 2021

@msau42 Yes, that's right.

For now, we can target 1.23 for the fix.

Sounds good

@@ -176,6 +178,14 @@ func (mounter *Mounter) doMount(mounterPath string, mountCmd string, source stri
command := exec.Command(mountCmd, mountArgs...)
output, err := command.CombinedOutput()
if err != nil {
if err.Error() == errNoChildProcesses {
if command.ProcessState.Success() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to test it with this scenario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed this on a customer prod cluster and couldn't reproduce it on my end so far. Seems to depend on the volume usage pattern and pod churn on their nodes.

Copy link
Member

@gnufied gnufied Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw - rather than comparing strings I wonder if we could have unwrapped the error and looked for syscall.ECHILD error specifically?

@jingxu97
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2021
@jsafrane
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, shyamjvs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2021
@k8s-ci-robot k8s-ci-robot merged commit f3e9fdc into kubernetes:master Aug 5, 2021
jsafrane added a commit to jsafrane/aws-efs-csi-driver that referenced this pull request Oct 18, 2021
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
wongma7 added a commit to wongma7/aws-efs-csi-driver that referenced this pull request Oct 19, 2021
k8s-ci-robot added a commit that referenced this pull request Nov 23, 2021
…80-upstream-release-1.22

Automated cherry pick of #103780: Ignore 'wait: no child processes' error when calling
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/aws-efs-csi-driver that referenced this pull request Feb 21, 2022
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
@shyamjvs shyamjvs deleted the fix-exec-command branch April 5, 2022 18:37
tsmetana pushed a commit to tsmetana/aws-efs-csi-driver that referenced this pull request Sep 22, 2022
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
RomanBednar pushed a commit to RomanBednar/aws-efs-csi-driver that referenced this pull request Oct 26, 2022
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
tsmetana pushed a commit to tsmetana/aws-efs-csi-driver that referenced this pull request Dec 16, 2022
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
RomanBednar pushed a commit to RomanBednar/aws-efs-csi-driver that referenced this pull request Feb 7, 2023
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
jsafrane added a commit to jsafrane/aws-efs-csi-driver that referenced this pull request Mar 17, 2023
We want kubernetes/kubernetes#103780 in the EFS CSI
driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629

We need to carry it until upstream updates to k8s 1.23 or newer.

This replaces 7caacdb:
    UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/aws-efs-csi-driver that referenced this pull request Mar 17, 2023
We want kubernetes/kubernetes#103780 in the EFS CSI
driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629

We need to carry it until upstream updates to k8s 1.23 or newer.

This replaces 7caacdb:
    UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
tsmetana pushed a commit to tsmetana/aws-efs-csi-driver that referenced this pull request May 4, 2023
…ng mount/umount

Based on github.com/kubernetes/kubernetes/pull/103780. This PR is not in
the CSI driver repo yet, marking as <carry>. To be carried in OCP until the
EFS CSI driver upstream updates k8s.io/mount-utils v1.23
tsmetana pushed a commit to tsmetana/aws-efs-csi-driver that referenced this pull request May 4, 2023
We want kubernetes/kubernetes#103780 in the EFS CSI
driver to fix https://bugzilla.redhat.com/show_bug.cgi?id=2056629

We need to carry it until upstream updates to k8s 1.23 or newer.

This replaces 7caacdb:
    UPSTREAM: <carry>: Ignore 'wait: no child processes' error when calling mount/umount
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Volume mount/unmount errors saying "wait: no child processes"
7 participants