Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: revert Windows change to boot-time timers #35482

Open
networkimprov opened this issue Nov 9, 2019 · 27 comments

Comments

@networkimprov
Copy link

@networkimprov networkimprov commented Nov 9, 2019

An engineering lead on the Windows Base team (kernel, fs, etc) asked us to revert d85072, from #31528, because it changed Windows timers to advance during sleep; everywhere else Go has monotonic timers (see also #24595 #35012).

Quoting @jstarks from #35447 (comment)

The Windows kernel team changed timer behavior in Windows 8 to stop advancing relative timeouts on wake. Otherwise when you open your laptop lid, every timer in the system goes off all at once and you get a bunch of unpredictable errors. Software is generally written to assume that local processes will make forward progress over reasonable time periods, and if they don't then something is wrong. When the machine is asleep, this assumption is violated. By making relative timers behave like threads, so that they both run together or they both don't, the illusion is maintained. You can claim these programs are buggy, but they obviously exist. Watchdog timers are well-known constructs.

This was a conscious design decision in Windows, and so it's disappointing to see the Go runtime second guess this several years later in a bug fix.

@alexbrainman suggested an alternative approach to fixing the reported issue, via QueryUnbiasedInterruptTime() in #31528 (comment). Let's try to adopt that for 1.14.

We should backport that to 1.12 & 1.13, also reverting the commit which landed in 1.13.3, see #34130.

cc @ianlancetaylor @rsc @aclements @zx2c4 @jmontgomery-jc
@gopherbot add OS-Windows
@gopherbot add release-blocker

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 15, 2019

Is anybody working on the alternative approach mentioned above?

@alexbrainman

This comment has been minimized.

Copy link
Member

@alexbrainman alexbrainman commented Nov 15, 2019

Is anybody working on the alternative approach mentioned above?

No, I am not making this change.

@zx2c4 said in #31528 (comment)

It's impossible to implement WireGuard securely if timers don't take into account sleep time.

CL 191957 fixes this problem. If we revert the CL and replace it with QueryUnbiasedInterruptTime, the problem will reappear.

Also using QueryUnbiasedInterruptTime will make nanotime implementation about 2 times slower. See #31528 (comment) for details.

Alex

@jstarks

This comment has been minimized.

Copy link

@jstarks jstarks commented Nov 15, 2019

I can find some time to make the change that Alex originally prototyped if no one else is available. But in the meantime, a patch release of Go has regressed programs running in containers, so should we consider reverting this and then apply a more appropriate change later?

It's impossible to implement WireGuard securely if timers don't take into account sleep time.

CL 191957 fixes this problem. If we revert the CL and replace it with QueryUnbiasedInterruptTime, the problem will reappear.

As I understand it, Go on Linux has the same behavior as Go on Windows used to have before this change. WireGuard is specialized software and may have to work around this in both Windows and Linux.

I also think the fix in CL 191957 is incomplete--AFAIK, PowerRegisterSuspendResumeNotification only provides notifications for machines transitioning across classic sleep states, not when machines enter connected standby (which is used instead of sleep in some newer devices). In these cases, you will still see a difference between (biased) interrupt time and WaitForSingleObject's relative timers, so WireGuard presumably will still run into problems.

The right fix for WireGuard may be to offer a new kind of timer that uses absolute (wall clock) timeouts on Windows, which is affected by changes to system UTC time (NTP or otherwise) but not by sleep states. If that's insufficient, I can help investigate if there are other options that might be appropriate.

Also using QueryUnbiasedInterruptTime will make nanotime implementation about 2 times slower. See #31528 (comment) for details.

If slowing nanotime down from 2ns to 4ns is problematic, we can look at whether the internal definition of QueryUnbiasedInterruptTime is stable enough to inline into Go's runtime (which is what was done for the current version of nanotime: it was apparently cloned from QueryInterruptTime). I'd really like to avoid this, though, because the current definition relies on a private export from ntdll to the kernel that is not part of the external API or ABI.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Nov 15, 2019

This patch wouldn't have been accepted if the discussion about it had referenced #24595.

@zx2c4 said in #31528 (comment)

It's impossible to implement WireGuard securely if timers don't take into account sleep time.

Jason has stated that WireGuard requires timer patches for Linux, Darwin, and (prior to 1.13.3) Windows. Reverting this won't harm WG. I contacted him yesterday to point out this issue, and he ack'd, so I imagine he'll respond soon.

The right fix for WireGuard may be to offer a new kind of timer that uses absolute (wall clock) timeouts

I've been advocating for this on #24595 but so far, no traction...

@DmitriyMV

This comment has been minimized.

Copy link

@DmitriyMV DmitriyMV commented Nov 15, 2019

Quoting @mpx:

With BOOTTIME buggy use of timers may fail, with MONOTONIC correct use of timers may fail (Eg, #25248, #35012).

I don't think that trading correctness for the sake of backward compatibility is a right choice. I also don't think that adding new API to time package would be wise - most people don't care about the difference between MONOTONIC and BOOTTIME, and it will at best leave them confused, and at worst lead them to incorrect assumptions.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Nov 15, 2019

@DmitriyMV, that probably belongs in the thread you quoted; it's off-topic here.

@jstarks

This comment has been minimized.

Copy link

@jstarks jstarks commented Nov 15, 2019

@DmitriyMV , right now, with this change, we are inconsistent between Linux and Windows, and inconsistent between different Windows devices (connected standby vs. classic sleep). That inconsistency seems like the worst possible situation to be in.

@alexbrainman

This comment has been minimized.

Copy link
Member

@alexbrainman alexbrainman commented Nov 16, 2019

I can find some time to make the change that Alex originally prototyped if no one else is available. But in the meantime, a patch release of Go has regressed programs running in containers, so should we consider reverting this and then apply a more appropriate change later?

I have no opinion on this matter. I will let Ian decide what to do here.

Alex

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

Please close this issue and do not make any such revert.

The premises here are flawed.

  • Both Linux and macOS seem interested in moving to a BOOTTIME world. From discussion with the maintainers of the timer code in Linux, not doing this earlier is considered an unfortunate historical mishap. It'd be nice to change everything over in the kernel now all at once (which was tried some months ago), but compatibility issues make this a bit more tricky. So instead it's going to be a gradual shift in that direction.
  • MONOTONIC instead BOOTTIME makes implementing network protocols, such as WireGuard, impossible. Go must use BOOTTIME if we're going to have WireGuard or other similar network protocols.
  • From the original post, "The Windows kernel team changed timer behavior in Windows 8 to stop advancing relative timeouts on wake. Otherwise when you open your laptop lid, every timer in the system goes off all at once and you get a bunch of unpredictable errors." There are a lot of things wrong with this: firstly, I'm shocked that Microsoft broke compatibility -- they usually don't do that. Secondly, it's often the case that processes receive timer events "all at once" bunched up. On Unix, this happens all the time with SIGSTOP...SIGCONT, and Go code is generally fine. Heck, Go code better be fine. And, on highly loaded systems, scheduler latency often means this happens naturally.
  • From the original post, "This was a conscious design decision in Windows, and so it's disappointing to see the Go runtime second guess this several years later in a bug fix." Conscious or not, it's an annoying and bad decision, but not one that really matters to us. We're not "second guessing" - we're simply implementing the Go runtime as reasonably as we can given what the OS kernel provides.
@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

@DmitriyMV , right now, with this change, we are inconsistent between Linux and Windows, and inconsistent between different Windows devices (connected standby vs. classic sleep). That inconsistency seems like the worst possible situation to be in.

Doing things the right way on Linux and other platforms is a work in progress. Feel happy for the rare case in which the Windows implementation achieves the correct implementation (using BOOTTIME) first.

@jstarks

This comment has been minimized.

Copy link

@jstarks jstarks commented Nov 21, 2019

This was a breaking change in a bug fix release that introduces inconsistent behavior between operating systems. There is clearly no consensus that this is the right change for Linux or the change would have been made already.

It's very strange to me that this is considered an acceptable approach to the evolution of the Go runtime.

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

More false claims. A lot to unpack in three sentences. Here we go:

There is clearly no consensus that this is the right change for Linux or the change would have been made already.

"Would have been made already" is a ridiculous conclusion to jump to. They tried it, but it broke some userland in unexpected ways. The timer maintainers want to do it. It's just a matter of figuring out how.

This was a breaking change in a bug fix release

No, it fixed a regression with Windows timer buckets. Before the fix, Go timers were not reliable following a compatibility-breaking OS change from Microsoft.

It also keeps WireGuard viable on Windows. Not taking into account sleep time makes WireGuard and other network protocols impossible to implement. It's possible your branch of Microsoft isn't interested in WireGuard, but I'm told some NDIS people are playing with it.

inconsistent behavior between operating systems.

As mentioned, it's an ongoing work in progress to bring BOOTTIME support to other operating systems.

that introduces

Clearly "introduces" is the wrong word, since Windows had always been like this, until Windows 8, at which time there were actually two timers semantics being used at the same time, causing problems. The bug fix moved things back to only using one set of timer semantics, fixing the problem.

And guess which set of timer semantics it chose in order to fix the problem? The one that had always been used on Windows in Go since the beginning. It didn't introduce a new one, as that could have caused problems. Instead it went back to providing the same timer semantics that Go had originally.

@jstarks

This comment has been minimized.

Copy link

@jstarks jstarks commented Nov 21, 2019

I'm going to double check my Windows 8 claim--it may have been Windows 7 (which would make sense because that's when QueryUnbiasedInterruptTime was introduced). I'll ask someone down the hall who has worked on the Windows timer infrastructure when I get a chance.

In any case, before the change to the Go runtime, Go programs inherited the system timer behavior. As far as I know, there was never a case before Go 1.13.3 that Go attempted to force a BOOTTIME-style timer behavior on Windows (or on Linux).

So yes, this change was a breaking change to Go timer semantics on Windows. I'm certainly sympathetic that WireGuard needs a solution here, but there is other Go software out there too.

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

I'm going to double check my Windows 8 claim--it may have been Windows 7

Your Windows 8 claim is correct. MSDN confirms this, as does my own testing. From https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject : "Windows XP, Windows Server 2003, Windows Vista, Windows 7, Windows Server 2008 and Windows Server 2008 R2: The dwMilliseconds value does include time spent in low-power states. For example, the timeout does keep counting down while the computer is asleep. Windows 8, Windows Server 2012, Windows 8.1, Windows Server 2012 R2, Windows 10 and Windows Server 2016: The dwMilliseconds value does not include time spent in low-power states. For example, the timeout does not keep counting down while the computer is asleep."

As far as I know, there was never a case before Go 1.13.3 that Go attempted to force a BOOTTIME-style timer behavior on Windows (or on Linux).

Before 1.13.3, Go relied on BOOTTIME-style timer behavior working, even though it did not on certain newer Windows platforms, and so Go was broken on those platforms and we got bug reports. The Go behavior on Windows has always been to rely on this BOOTTIME-style of behavior, but Windows 8 put us in an inconsistent state. 1.13.3 fixed the inconsistency by reverting to the semantics Windows users of Go have always relied on. What you're suggesting here is introducing totally new semantics that we've never relied on. That sounds like a new proposal, and something I emphatically n'ack.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Nov 21, 2019

I'm writing an app for multiple laptop platforms (Windows, MacOS, Linux) and want the same timer behavior across them. If the default behavior on Windows (or certain versions thereof) differs, I'd expect a switch to throw which aligns them.

That means a switch to let Windows 7 work like 8/10 and Unix, or vice versa. My current project could carry a runtime patch for this (I already need 2 stdlib patches on Windows).

I agree that Go should provide timers with boot-time semantics, but probably not as an upgrade to time.Timer/Ticker. How would apps that depend on timers reliably evaluate such an upgrade? What is the cost to adapt if the evaluation indicates trouble?

Go timers have been "broken" on Windows 8/10 for seven years and I've seen one bug report, filed in 2019; are there others? What did WireGuard do on Windows 8/10 before this patch? How does it handle timers on MacOS/Unix?

@dmitshur

This comment has been minimized.

Copy link
Member

@dmitshur dmitshur commented Nov 21, 2019

@zx2c4 Could you please elaborate on why you wanted to remove the release-blocker label?

We were going over release-blocker issues in a meeting, and because no one knew why it was removed, we thought it was a gopherbot bug and re-added it. We learned in #35755 that you requested it to be removed, but it wasn't visible to us at the time.

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

What did WireGuard do on Windows 8/10 before this patch?

Sleep was broken.

How does it handle timers on MacOS/Unix?

We patch Golang.

@zx2c4 Could you please elaborate on why you wanted to remove the release-blocker label?
We were going over release-blocker issues in a meeting, and because no one knew why it was removed, we thought it was a gopherbot bug and re-added it. We learned in #35755 that you requested it to be removed, but it wasn't visible to us at the time.

Oh, whoops, didn't think that'd be a big deal. The actual release-blocker bug is the docker issue somebody reported earlier -- #35447. Breaking docker seems very bad. This thread here, on the other hand, is some bikeshedding on if we should change Go's behavior from how it was originally designed way back when to something new and different. Not sure why this discussion would need to block a release.

@jstarks

This comment has been minimized.

Copy link

@jstarks jstarks commented Nov 21, 2019

I think it's disrespectful to brush this issue off as bikeshedding. I don't mind if you clear the release blocking tag (I didn't add it), but discussing the technical merits of a bug fix that you implemented is anything but bikeshedding.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Nov 21, 2019

So we have a likely near-term solution:
a) revert the patch in question,
b) provide a switch and/or fix to align Win7 with the other runtimes,
c) WireGuard can patch its runtime as it does for other platforms.

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Nov 21, 2019

So we have a likely near-term solution:
a) revert the patch in question,
b) provide a switch and/or fix to align Win7 with the other runtimes,
c) WireGuard can patch its runtime as it does for other platforms.

That solution is unacceptable and will reintroduce the timerbucket stalling bug we had prior, reported by somebody else.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Nov 21, 2019

d) fix timerbucket stalling on Win8/10 with QueryUnbiasedInterruptTime(), per #31528 (comment)

EDIT: That API is available on Win7, so also applies to (b) above.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Dec 2, 2019

@jstarks could you amend or endorse this solution?

a) revert the patch in question,
b) fix Win7 to align with the other runtimes via QueryUnbiasedInterruptTime(),
c) fix timerbucket stalling on Win8/10 via QueryUnbiasedInterruptTime(), per #31528 (comment)

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Dec 2, 2019

Microsoft asked us to revert d85072,

Just for the record, the "Microsoft" in question here is an employee who has asked that his mentions not be considered as speaking on behalf of Microsoft Corporation:

@jstarks writes in #34681 (comment) :

Also regardless I think "Microsoft requested" is a weird way to put things, since I am not the Go representative for Microsoft in any mutually agreed capacity (if such a position can or should even exist).

This was in response to @rsc aptly pointing out in #34681 (comment) :

I would be more persuaded by the "Microsoft requested that" argument for doing #32088 if the argument was coming from someone working on Windows, or maintaining the Go port to Windows, instead of someone whose bio says "Container+Linux guy at Microsoft".

I don't think that the this issue here has much technical merit, and I'd appreciate it being closed so that I don't have to continue living with the unstable prospect that the Go runtime will break my application.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Dec 2, 2019

Corrected above to read:
"An engineering lead on the Windows Base team (kernel, fs, etc) asked us to revert..."

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Dec 2, 2019

Hi. Sorry for the silence here. Just a heads up that I have a few other things at the top of my list right now, but then I plan to review all the history here and wade in.

@zx2c4

This comment has been minimized.

Copy link
Contributor

@zx2c4 zx2c4 commented Dec 4, 2019

Here's (yet) another project that will break if you move forward with what @networkimprov is advocating: https://github.com/mozilla-services/guardian-vpn-windows Please reject this proposal.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Dec 4, 2019

...because the project's 40 lines of Go imports Wireguard? (It does not directly import pkg time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.