Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: remove the mlock hack in Go 1.15 #40184

Closed
DanielShaulov opened this issue Jul 13, 2020 · 23 comments
Closed

runtime: remove the mlock hack in Go 1.15 #40184

DanielShaulov opened this issue Jul 13, 2020 · 23 comments
Labels
Milestone

Comments

@DanielShaulov
Copy link

@DanielShaulov DanielShaulov commented Jul 13, 2020

The resolution to #35777 was to mlock the top of stack on "affected" kernels. But then it was discovered that major distros use kernels that appear by the version number to be affected but are actually patched (#37436). It was also discovered that docker under systemd has a very low limit on mlocked pages (same issue). The resolution for that was to stop reporting failures of mlock and delay the warning until a crash is caught.

There is a TODO comment in the code to remove all of that hack for Go 1.15, since unpatched kernels are unlikely to be encountered, but Go 1.15 is due to be released in a short time and the code is still there.

I think it is important to remove the workaround since Ubuntu 20.04 LTS uses a patched 5.4.0 kernel. This means that any user on Ubuntu 20.04 will still unnecessarily mlock pages, and if he runs in a docker container, that warning will be displayed for every crash, disregarding the fact that his kernel is not really buggy. So those users might be sent on a wild goose chase trying to understand and read all this info, and it will have nothing to do with their bug, probably for the entirety of Ubuntu 20.04 life cycle.

Relevant code is in src/runtime/os_linux_x86.go

@ALTree
Copy link
Member

@ALTree ALTree commented Jul 13, 2020

There is a TODO comment in the code to remove all of that hack for Go 1.15 [...] but Go 1.15 is due to be released in a short time and the code is still there.

Can you link where this TODO is? The only relevant comment I found was on CL 223121, in the runtime/panic.go file, but it says:

// TODO(austin): Remove this after Go 1.15 when we remove the
// mlockGsignal workaround

and "after Go 1.15" means for Go 1.16.

@ALTree ALTree changed the title Remove the mlock hack in Go 1.15 runtime: remove the mlock hack in Go 1.15 Jul 13, 2020
@randall77
Copy link
Contributor

@randall77 randall77 commented Jul 13, 2020

@nemith
Copy link
Contributor

@nemith nemith commented Jul 13, 2020

Since the kernel detection is done based off of major/minor numbers we actually have to apply a patch to internal builds of the compiler to include our own kernel builds which are not affected but fall in the affected range. I would love for this to be removed or at least optionally bypassed

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 13, 2020

I don't think this needs further info from the author to be actionable...

@gopherbot remove WaitingForInfo

@aclements
Copy link
Member

@aclements aclements commented Jul 13, 2020

@nemith , can you explain why you need to patch the compiler? As of Go 1.14.1, the only effect of an mlock failure is an additional message if there's an uncaught panic, so I don't understand why that requires patching.

@aclements aclements added this to the Go1.15 milestone Jul 13, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 13, 2020

If there is a way to detect the patched kernel, perhaps by reading /proc/version, I think we could implement that safely enough in 1.15.

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 13, 2020

Re kernel version sources, see this and comments immediately preceding it: #37436 (comment)

@aclements
Copy link
Member

@aclements aclements commented Jul 13, 2020

There's one deciding factor here: is the kernel patch widespread enough that the workaround is no longer necessary? I'm sure all the distros have it, but people aren't necessarily on top of upgrading their kernels. Unfortunately, I have no idea how to figure out the answer to this, and I think we should be fairly conservative about this given the nature of the kernel bug.

@lootch
Copy link

@lootch lootch commented Jul 14, 2020

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 14, 2020

A compromise solution: omit the mlock workaround for a few major distros when the version is known to include the kernel fix.

More distros could be added in later 1.15.1+ releases by folks who can test them.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 20, 2020

@DanielShaulov Would it address your problem if we provided an environment variable to disable the use of mlock, as in GODEBUG=mlock=0?

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 20, 2020

Change https://golang.org/cl/243658 mentions this issue: runtime: let GODEBUG=mlock=0 disable mlock calls

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 20, 2020

CL 243658 adds GODEBUG=mlock=0 to disable the use of mlock. If that fixes the problem, we can commit that for 1.15 (and perhaps 1.14.7) and then repurpose this issue for removing this code for 1.16.

@DanielShaulov
Copy link
Author

@DanielShaulov DanielShaulov commented Jul 20, 2020

Hi, I am very sorry for not responding, didn't get notifications until I got pinged.

@ALTree Here is the relevant TODO, it says "in Go 1.15" and not "after Go 1.15":

// TODO(austin): Remove this in Go 1.15, at which point it
// will be unlikely to encounter any of the affected kernels
// in the wild.

But I see that the other places used "after Go 1.15", so this one might have been a mistake?

@networkimprov At the very least - all the 5.4.0 kernels that Ubuntu ever published were patched (I checked all the 5.4.0-* tags in their git). So I think just checking for that will cover a pretty large set of false positive.
Redhat family can be checked to see if they have an LTS release stuck with a kernel version in the "bad" range, but if not then their problem is much smaller than Ubuntu's.

@ianlancetaylor See above about the confusion about the version it was supposed to be fixed. I am OK with re-purposing the issue for 1.16, I was just worried that this was just forgotten and be left like this for the whole duration of Ubuntu 20.04.
But I still think adding another check for the major distros as networkimprov suggested is a good idea.
I don't think the DEBUG=mlock=0 thing would really solve the issue. Developers will still spend time looking into those issue, reading the wiki, running the test program, just to find out that their kernel is already patched, and their issue lies elsewhere.
But the resolution for this can wait for Go 1.16 I guess.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 20, 2020

@DanielShaulov Thanks. Since the GODEBUG environment variable won't help, I'll drop that CL.

I'm not sure how to proceed with checking major distros. What we need is some mapping from the /proc/version string to kernels that are known to be fixed.

@DanielShaulov
Copy link
Author

@DanielShaulov DanielShaulov commented Jul 21, 2020

Well - first we need to get our distro name, for that I think /etc/os-release is the most reliable way. We just need the NAME= field. (And if we fail that, no big deal, we just fall back to current behavior of just checking official kernel release).

Then, looking at #37436 (comment) we can probably just parse the end of the kernel release (the part after - which is currently discarded in parseRelease) and save it as tail,
And then in osArchInit, we can just switch on the distro name an have our per-distro logic.

For Ubuntu, it is a dead simple if major == 5 && minor >= 4 { return }, and doesn't even need the tail, since all 5.4.x kernels in Ubuntu are patched. (other distros might need the tail)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 22, 2020

https://golang.org/cl/244059 is code that skips the mlock on Ubuntu 5.4 systems, based on reading /proc/version.

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 22, 2020

Change https://golang.org/cl/244059 mentions this issue: runtime: don't mlock on Ubuntu 5.4 systems

@gopherbot gopherbot closed this in 71218db Jul 22, 2020
@networkimprov
Copy link

@networkimprov networkimprov commented Jul 22, 2020

@ianlancetaylor you could leave this issue open for discussion of other distros during the 1.16 cycle.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 22, 2020

In the 1.16 cycle we're going to remove the mlock calls (#35979).

@networkimprov
Copy link

@networkimprov networkimprov commented Jul 22, 2020

But you may add other distros to the don't-mlock list in 1.15.x.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 22, 2020

@networkimprov True, but I don't see that as a reason to keep this issue open. There is no action that we would take to close the issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 31, 2020

Change https://golang.org/cl/246200 mentions this issue: runtime: revert signal stack mlocking

gopherbot pushed a commit that referenced this issue Aug 13, 2020
Go 1.14 included a (rather awful) workaround for a Linux kernel bug
that corrupted vector registers on x86 CPUs during signal delivery
(https://bugzilla.kernel.org/show_bug.cgi?id=205663). This bug was
introduced in Linux 5.2 and fixed in 5.3.15, 5.4.2 and all 5.5 and
later kernels. The fix was also back-ported by major distros. This
workaround was necessary, but had unfortunate downsides, including
causing Go programs to exceed the mlock ulimit in many configurations
(#37436).

We're reasonably confident that by the Go 1.16 release, the number of
systems running affected kernels will be vanishingly small. Hence,
this CL removes this workaround.

This effectively reverts CLs 209597 (version parser), 209899 (mlock
top of signal stack), 210299 (better failure message), 223121 (soft
mlock failure handling), and 244059 (special-case patched Ubuntu
kernels). The one thing we keep is the osArchInit function. It's empty
everywhere now, but is a reasonable hook to have.

Updates #35326, #35777 (the original register corruption bugs).
Updates #40184 (request to revert in 1.15).
Fixes #35979.

Change-Id: Ie213270837095576f1f3ef46bf3de187dc486c50
Reviewed-on: https://go-review.googlesource.com/c/go/+/246200
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.