Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: remove 5ms sleep on Windows in (*Process).Wait #25965

Open
bradfitz opened this issue Jun 19, 2018 · 24 comments
Open

os: remove 5ms sleep on Windows in (*Process).Wait #25965

bradfitz opened this issue Jun 19, 2018 · 24 comments

Comments

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Jun 19, 2018

In our Windows implementation of os/Process.Wait, we always do a 5 millisecond sleep:

func (p *Process) wait() (ps *ProcessState, err error) {
	handle := atomic.LoadUintptr(&p.handle)
	s, e := syscall.WaitForSingleObject(syscall.Handle(handle), syscall.INFINITE)
	switch s {
	case syscall.WAIT_OBJECT_0:
		break
	case syscall.WAIT_FAILED:
		return nil, NewSyscallError("WaitForSingleObject", e)
	default:
		return nil, errors.New("os: unexpected result from WaitForSingleObject")
	}
	var ec uint32
	e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
	if e != nil {
		return nil, NewSyscallError("GetExitCodeProcess", e)
	}
	var u syscall.Rusage
	e = syscall.GetProcessTimes(syscall.Handle(handle), &u.CreationTime, &u.ExitTime, &u.KernelTime, &u.UserTime)
	if e != nil {
		return nil, NewSyscallError("GetProcessTimes", e)
	}
	p.setDone()
	// NOTE(brainman): It seems that sometimes process is not dead
	// when WaitForSingleObject returns. But we do not know any
	// other way to wait for it. Sleeping for a while seems to do
	// the trick sometimes. So we will sleep and smell the roses.
	defer time.Sleep(5 * time.Millisecond)
	defer p.Release()
	return &ProcessState{p.Pid, syscall.WaitStatus{ExitCode: ec}, &u}, nil
}

Sleeps in code are always gross. They're either too fast (and still flaky) or too slow (incurring extra delays on everybody when not needed).

This bug is about figuring out what we're doing wrong on Windows that made us need that sleep in the first place.

https://go-review.googlesource.com/c/go/+/84175 was one attempt to remove it, but the goal then was fixing a Windows XP test failure (for #17245), and it didn't fix the XP failure. We no longer support XP, though, so maybe that CL is okay now.

But maybe it's not correct.

There was also https://golang.org/cl/84896 from @johnsonj to use jobs to wait for process completion on windows, but it was "not working as expected on Server 2008" and @alexbrainman had concerns in #17245 (comment) (that whole thread is worth a read, even if it's mostly about debugging an XP issue, which might've been related to this bug.)

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jun 20, 2018

If you do not delete executable file immediately after its process exits, then you could remove that 5ms sleep, and nothing will change for you.

So we really need that extra sleep mainly to make cmd/go work properly, because cmd/go runs programs and deletes them immediately.

Perhaps we could move that 5ms sleep from os package and into cmd/go code.

Alex

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Jun 20, 2018

It should be possible to delete a file after it finishes executing, for any program, not just cmd/go.

This bug is about figuring out what synchronization we're missing.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jun 23, 2018

This bug is about figuring out what synchronization we're missing.

I understand that. I was suggesting what to do in case we cannot find such synchronization.

Alex

@iamoryanmoshe
Copy link
Contributor

@iamoryanmoshe iamoryanmoshe commented Jul 21, 2018

Is the sleep needed on every process we want to wait on?
WinApi docs state that

... if you have a thread that creates windows, use MsgWaitForMultipleObjects or MsgWaitForMultipleObjectsEx, rather than WaitForSingleObject.

We might want to use a different wait method on windows?

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Oct 25, 2018

When I tried to remove the sleep in https://go-review.googlesource.com/c/go/+/84175 I was doing it mostly for debugging Windows XP problems (#2866 and #17245), and it apparently didn't work.

But we don't support Windows XP anymore, so maybe we can remove it now?

But I forget everything about Windows I had learned while researching that patch.

/cc @alexbrainman @jordanrh1

@jordanrh1
Copy link
Contributor

@jordanrh1 jordanrh1 commented Oct 25, 2018

I've seen a lot of code that waits on a process handle, and none of it sleeps. It would be a major bug if you couldn't trust a process handle to be signaled at the right time, and would surely be fixed. The sleep must be masking a problem somewhere else. I'm all for removing the sleep and seeing what shakes loose.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Oct 28, 2018

I'm all for removing the sleep and seeing what shakes loose.

SGTM

Alex

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 28, 2018

Change https://golang.org/cl/145221 mentions this issue: os: remove sleep in windows Process.Wait

@gopherbot gopherbot closed this in 5c35973 Oct 29, 2018
@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Oct 31, 2018

@bradfitz @jordanrh1

CL 145221 have broken windows-2008 builder - latest snapshot of https://build.golang.org/

image

Even one failure on windows-2012 https://build.golang.org/log/d37c2bad6c49d20fe4bec6f1a8d380e583ea79e8

What should we do now?

Alex

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Oct 31, 2018

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Is there some other handle we're not closing quickly?

@jordanrh1, could you audit some of these code paths and make sure we're doing things right? (clearly we aren't)

@jordanrh1
Copy link
Contributor

@jordanrh1 jordanrh1 commented Oct 31, 2018

I'm not seeing anything obviously wrong with the code, but these are the things I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Nov 4, 2018

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Yes, these errors happens more often. Maybe it was happening on Windows XP is even more often.

Is there some other handle we're not closing quickly?

I think that once process handle is signaled - syscall.WaitForSingleObject returns - the process is exited. So opened handles could not matter here. But I could be wrong about that.

I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

I don't see how it could be related to (2). Like I said above, we wait for syscall.WaitForSingleObject on process handle to return, so whatever handles process opens, they should be all closed now. And when new process starts, we use syscall.DuplicateHandle to create stdin, stdout and stderr handles for child process and close the handles immediately - see syscall.StartProcess for details.

I am not sure what to say about (3) and (4).

Should I revert CL 145221 in the meantime?

Alex

@bradfitz
Copy link
Contributor Author

@bradfitz bradfitz commented Nov 9, 2018

Should I revert CL 145221 in the meantime?

Sure? Super sad, but we don't have good short-term leads.

If you do, please reopen this bug and reference both this bug in the commit, and #23171.

/cc @aclements

@aclements
Copy link
Member

@aclements aclements commented Nov 9, 2018

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop? If the theory is that Windows signals the process handle before releasing the file, that should test it.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

@jordanrh1
Copy link
Contributor

@jordanrh1 jordanrh1 commented Nov 9, 2018

Yes, what @aclements suggested.

I should mention that I don't have any cycles to test this until the 19th.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 9, 2018

Change https://golang.org/cl/148957 mentions this issue: Revert "os: remove sleep in windows Process.Wait"

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Nov 9, 2018

reopen this bug and reference both this bug in the commit, and #23171.

https://go-review.googlesource.com/c/go/+/148957

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop?

I will try to do what you have asked. I would not be surprised, if the system configuration or OS version will also matter here - imagine, if you have an antivirus program eager to catch executable with virus in them, or another "helpful" service.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

I am not sure what locks you are talking about. Please, explain more.

All errors, we see here, happen during cmd/go test. Perhaps, the exe in questions starts itself as a child, and then parent process exists, and we notice parent exit, but do not wait for the child? But all examples of faults are simple run of a program - they do not look like starting any child processes.

Alex

@alexbrainman alexbrainman reopened this Nov 9, 2018
gopherbot pushed a commit that referenced this issue Nov 9, 2018
This reverts CL 145221 (commit 5c35973)

Reason for revert: breaks the build occasionally.

Updates #23171
Updates #25965

Change-Id: Ie1e3c76ab9bcd8d28b6118440b5f80c76f9b1852
Reviewed-on: https://go-review.googlesource.com/c/148957
Run-TryBot: Alex Brainman <alex.brainman@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
alexbrainman added a commit to alexbrainman/winapi that referenced this issue Apr 11, 2019
@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Apr 11, 2019

I will try to do what you have asked.

I created github.com/alexbrainman/winapi/tree/master/issue25965.TestIssue25965

The test runs required exercise for both Go version (only syscall package is used) and C version. The test fails immediately on my Windows 7 computer. But it was much harder to make it fail on my Windows 10 computer - I managed to make it fail after I added a loop of 100 iterations around the test - so be warned that the test might take a while to run. Anyway my Windows 10 output looks like this:

c:\>u:\test
PASS

c:\>u:\test
--- FAIL: TestIssue25965 (84.83s)
    main_test.go:166: running c exe failed: exit status 1 iteration 56: CopyFile failed (5)
FAIL

c:\>u:\test
--- FAIL: TestIssue25965 (92.83s)
    main_test.go:143: iteration 83 failed: CopyFile: Access is denied.
FAIL

c:\>u:\test
--- FAIL: TestIssue25965 (64.27s)
    main_test.go:166: running c exe failed: exit status 1 iteration 14: CopyFile failed (5)
FAIL

c:\>

Let me know, if you want me to do anything else.

Alex

@maruel
Copy link
Contributor

@maruel maruel commented May 24, 2019

Keep in mind that users with a poorly implemented anti virus installed may get the issue "executable binary is locked a bit after the process exited" anyway, independent on how solid you do the implementation here.

I haven't tested but I think the right solution (if it works) would be:

  • Create the executable with FILE_FLAG_DELETE_ON_CLOSE, do not close the handle.
  • Launch the program.
  • Wait for the program to quit.
  • Close the handle.
    When all handles are closed, the file would be deleted. That said, the file deletion is still asynchronous ¯\_(ツ)_/¯. If you want to wait for the file to be effectively deleted, you'd have to wait for it and that's more work.
@maruel
Copy link
Contributor

@maruel maruel commented Jun 3, 2019

My proposal assumes FILE_SHARE_DELETE is usable, as requested in #32088.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jun 4, 2019

Keep in mind that users with a poorly implemented anti virus installed may get the issue "executable binary is locked a bit after the process exited" anyway, independent on how solid you do the implementation here.

I agree.

I haven't tested but I think the right solution (if it works) would be:

I wasted enough time here. If you want to try and implement this, go ahead. We will review your code.

Alex

@maruel
Copy link
Contributor

@maruel maruel commented Jun 4, 2019

Thanks, anyhow I'd prefer this to be slated on go1.14 since it's a significant behavior change that could have subtle behavior impact so I'll plan to work on this later.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jun 6, 2019

I'd prefer this to be slated on go1.14 since it's a significant behavior change that could have subtle behavior impact so I'll plan to work on this later.

SGTM.

Alex

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 10, 2019

Change https://golang.org/cl/181541 mentions this issue: cmd/go: factor the I/O-retry logic out of renameio

gopherbot pushed a commit that referenced this issue Jun 13, 2019
Factor the try-on-failure variants are now in the package
cmd/go/internal/robustio.

Add to them a RemoveAll variant using the same retry loop,
and use it to attempt to address the observed flakes in
TestLinkXImportPathEscape.

Fixes #19491
Updates #25965
Updates #28387
Updates #32188

Change-Id: I9db1a0c7537b8aaadccab1b9eca734595668ba29
Reviewed-on: https://go-review.googlesource.com/c/go/+/181541
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Alex Brainman <alex.brainman@gmail.com>
Reviewed-by: Jay Conrod <jayconrod@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.