New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: remove 5ms sleep on Windows in (*Process).Wait #25965

Open
bradfitz opened this Issue Jun 19, 2018 · 17 comments

Comments

Projects
None yet
6 participants
@bradfitz
Member

bradfitz commented Jun 19, 2018

In our Windows implementation of os/Process.Wait, we always do a 5 millisecond sleep:

func (p *Process) wait() (ps *ProcessState, err error) {
	handle := atomic.LoadUintptr(&p.handle)
	s, e := syscall.WaitForSingleObject(syscall.Handle(handle), syscall.INFINITE)
	switch s {
	case syscall.WAIT_OBJECT_0:
		break
	case syscall.WAIT_FAILED:
		return nil, NewSyscallError("WaitForSingleObject", e)
	default:
		return nil, errors.New("os: unexpected result from WaitForSingleObject")
	}
	var ec uint32
	e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
	if e != nil {
		return nil, NewSyscallError("GetExitCodeProcess", e)
	}
	var u syscall.Rusage
	e = syscall.GetProcessTimes(syscall.Handle(handle), &u.CreationTime, &u.ExitTime, &u.KernelTime, &u.UserTime)
	if e != nil {
		return nil, NewSyscallError("GetProcessTimes", e)
	}
	p.setDone()
	// NOTE(brainman): It seems that sometimes process is not dead
	// when WaitForSingleObject returns. But we do not know any
	// other way to wait for it. Sleeping for a while seems to do
	// the trick sometimes. So we will sleep and smell the roses.
	defer time.Sleep(5 * time.Millisecond)
	defer p.Release()
	return &ProcessState{p.Pid, syscall.WaitStatus{ExitCode: ec}, &u}, nil
}

Sleeps in code are always gross. They're either too fast (and still flaky) or too slow (incurring extra delays on everybody when not needed).

This bug is about figuring out what we're doing wrong on Windows that made us need that sleep in the first place.

https://go-review.googlesource.com/c/go/+/84175 was one attempt to remove it, but the goal then was fixing a Windows XP test failure (for #17245), and it didn't fix the XP failure. We no longer support XP, though, so maybe that CL is okay now.

But maybe it's not correct.

There was also https://golang.org/cl/84896 from @johnsonj to use jobs to wait for process completion on windows, but it was "not working as expected on Server 2008" and @alexbrainman had concerns in #17245 (comment) (that whole thread is worth a read, even if it's mostly about debugging an XP issue, which might've been related to this bug.)

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Jun 20, 2018

If you do not delete executable file immediately after its process exits, then you could remove that 5ms sleep, and nothing will change for you.

So we really need that extra sleep mainly to make cmd/go work properly, because cmd/go runs programs and deletes them immediately.

Perhaps we could move that 5ms sleep from os package and into cmd/go code.

Alex

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 20, 2018

It should be possible to delete a file after it finishes executing, for any program, not just cmd/go.

This bug is about figuring out what synchronization we're missing.

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Jun 23, 2018

This bug is about figuring out what synchronization we're missing.

I understand that. I was suggesting what to do in case we cannot find such synchronization.

Alex

@iamoryanmoshe

This comment has been minimized.

Contributor

iamoryanmoshe commented Jul 21, 2018

Is the sleep needed on every process we want to wait on?
WinApi docs state that

... if you have a thread that creates windows, use MsgWaitForMultipleObjects or MsgWaitForMultipleObjectsEx, rather than WaitForSingleObject.

We might want to use a different wait method on windows?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Oct 25, 2018

When I tried to remove the sleep in https://go-review.googlesource.com/c/go/+/84175 I was doing it mostly for debugging Windows XP problems (#2866 and #17245), and it apparently didn't work.

But we don't support Windows XP anymore, so maybe we can remove it now?

But I forget everything about Windows I had learned while researching that patch.

/cc @alexbrainman @jordanrh1

@jordanrh1

This comment has been minimized.

Contributor

jordanrh1 commented Oct 25, 2018

I've seen a lot of code that waits on a process handle, and none of it sleeps. It would be a major bug if you couldn't trust a process handle to be signaled at the right time, and would surely be fixed. The sleep must be masking a problem somewhere else. I'm all for removing the sleep and seeing what shakes loose.

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Oct 28, 2018

I'm all for removing the sleep and seeing what shakes loose.

SGTM

Alex

@gopherbot

This comment has been minimized.

gopherbot commented Oct 28, 2018

Change https://golang.org/cl/145221 mentions this issue: os: remove sleep in windows Process.Wait

@gopherbot gopherbot closed this in 5c35973 Oct 29, 2018

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Oct 31, 2018

@bradfitz @jordanrh1

CL 145221 have broken windows-2008 builder - latest snapshot of https://build.golang.org/

image

Even one failure on windows-2012 https://build.golang.org/log/d37c2bad6c49d20fe4bec6f1a8d380e583ea79e8

What should we do now?

Alex

@bradfitz

This comment has been minimized.

Member

bradfitz commented Oct 31, 2018

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Is there some other handle we're not closing quickly?

@jordanrh1, could you audit some of these code paths and make sure we're doing things right? (clearly we aren't)

@jordanrh1

This comment has been minimized.

Contributor

jordanrh1 commented Oct 31, 2018

I'm not seeing anything obviously wrong with the code, but these are the things I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Nov 4, 2018

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Yes, these errors happens more often. Maybe it was happening on Windows XP is even more often.

Is there some other handle we're not closing quickly?

I think that once process handle is signaled - syscall.WaitForSingleObject returns - the process is exited. So opened handles could not matter here. But I could be wrong about that.

I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

I don't see how it could be related to (2). Like I said above, we wait for syscall.WaitForSingleObject on process handle to return, so whatever handles process opens, they should be all closed now. And when new process starts, we use syscall.DuplicateHandle to create stdin, stdout and stderr handles for child process and close the handles immediately - see syscall.StartProcess for details.

I am not sure what to say about (3) and (4).

Should I revert CL 145221 in the meantime?

Alex

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 9, 2018

Should I revert CL 145221 in the meantime?

Sure? Super sad, but we don't have good short-term leads.

If you do, please reopen this bug and reference both this bug in the commit, and #23171.

/cc @aclements

@aclements

This comment has been minimized.

Member

aclements commented Nov 9, 2018

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop? If the theory is that Windows signals the process handle before releasing the file, that should test it.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

@jordanrh1

This comment has been minimized.

Contributor

jordanrh1 commented Nov 9, 2018

Yes, what @aclements suggested.

I should mention that I don't have any cycles to test this until the 19th.

@gopherbot

This comment has been minimized.

gopherbot commented Nov 9, 2018

Change https://golang.org/cl/148957 mentions this issue: Revert "os: remove sleep in windows Process.Wait"

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Nov 9, 2018

reopen this bug and reference both this bug in the commit, and #23171.

https://go-review.googlesource.com/c/go/+/148957

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop?

I will try to do what you have asked. I would not be surprised, if the system configuration or OS version will also matter here - imagine, if you have an antivirus program eager to catch executable with virus in them, or another "helpful" service.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

I am not sure what locks you are talking about. Please, explain more.

All errors, we see here, happen during cmd/go test. Perhaps, the exe in questions starts itself as a child, and then parent process exists, and we notice parent exit, but do not wait for the child? But all examples of faults are simple run of a program - they do not look like starting any child processes.

Alex

@alexbrainman alexbrainman reopened this Nov 9, 2018

gopherbot pushed a commit that referenced this issue Nov 9, 2018

Revert "os: remove sleep in windows Process.Wait"
This reverts CL 145221 (commit 5c35973)

Reason for revert: breaks the build occasionally.

Updates #23171
Updates #25965

Change-Id: Ie1e3c76ab9bcd8d28b6118440b5f80c76f9b1852
Reviewed-on: https://go-review.googlesource.com/c/148957
Run-TryBot: Alex Brainman <alex.brainman@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment