Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: AVX512 register state clobbered by signal on macOS #49233

Open
vsivsi opened this issue Oct 29, 2021 · 36 comments
Open

runtime: AVX512 register state clobbered by signal on macOS #49233

vsivsi opened this issue Oct 29, 2021 · 36 comments

Comments

@vsivsi
Copy link

@vsivsi vsivsi commented Oct 29, 2021

What version of Go are you using (go version)?

% go version
go version go1.17.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
 % go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/vsi/Library/Caches/go-build"
GOENV="/Users/vsi/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/vsi/go/pkg/mod"
GONOPROXY="github.com/vsivsi"
GONOSUMDB="github.com/vsivsi"
GOOS="darwin"
GOPATH="/Users/vsi/go"
GOPRIVATE="github.com/vsivsi"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kp/kjdr0ytx5z9djnq4ysl15x0h0000gn/T/go-build4285878374=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

In writing AVX-512 assembly functions using the Go assembler (via Avo), I began to notice unexplained intermittent test failures. A lot of painful investigation revealed that golang async preemption does not seem to be properly save/restoring AVX512 state when assembly functions are interrupted. From visual inspection of the runtime/preempt_amd64.s source, it also seems likely that AVX/AVX2 state (YMM upper 128 bits) may not be properly save/restored either, but I have not encountered/tested that case in my own code.

A complete minimal reproduction illustrating the clobbering of the AVX512 K1 opmask register is here:
https://gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32

Note, in my testing running with GODEBUG=asyncpreemptoff=1 rescues every failure case I have identified.

What did you expect to see?

I expected all relevant processor state to be properly restored in an assembly function following an async preemption.

What did you see instead?

Critical state is being clobbered, leading to intermittent undefined behavior.

@vsivsi vsivsi changed the title AVX512 Opmask (K0-K7) registers clobbered by async preemption Async preemption clobbers AVX512 register state Oct 29, 2021
@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 29, 2021

Loading

@dmitshur dmitshur changed the title Async preemption clobbers AVX512 register state runtime: async preemption clobbers AVX512 register state Oct 29, 2021
@dmitshur dmitshur added this to the Backlog milestone Oct 29, 2021
@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 29, 2021

Hmm, this may be the VZEROUPPER we introduced in CL 219131 to fix #37174. Or it may be worse than that, I'm not sure.
@cherrymui

@vsivsi Can you try patching out #ifdef GOOS_darwin clause in runtime/preempt_amd64.s and see if that helps?

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Oct 29, 2021

The runtime doesn't async preempt assembly functions. asyncPreempt (the function in runtime/preempt_amd64.s) should not be executed if it is running in an assembly function. The runtime sends a signal, but if the signal lands in an assembly function the signal handler will return immediately without actually preempt.

I suspect that the darwin kernel doesn't preserve AVX512 state when handling signals. Then the problem would not be preemption, but any asynchronous signal could cause it to fail, including profiling signals or user signals.

What version of macOS are you running on?

I have a vague memory that Apple suggests that before using AVX512 one must check its availability with some sysctl (but I'm not really sure). Do we do that?

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Oct 30, 2021

@randall77 I can try that, but the initial issue I encountered (as reflected in the repro case) was clobbering of the K0-K7 opmask registers, which are distinct from the vector registers affected by VZEROUPPER, etc.

@cherrymui Yes, Darwin requires special checking to identify AVX512 support. I submitted a fix for this last year. See here for details on that saga: #43089

It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?

To maintain compatibility with existing assembly functions, the compiler generates adapter functions that convert between the new register-based calling convention and the previous stack-based calling convention. These adapters are typically invisible to users, except that taking the address of a Go function in assembly code or taking the address of an assembly function in Go code using reflect.ValueOf(fn).Pointer() or unsafe.Pointer will now return the address of the adapter. Code that depends on the value of these code pointers may no longer behave as expected. Adapters also may cause a very small performance overhead in two cases: calling an assembly function indirectly from Go via a func value, and calling Go functions from assembly.

Loading

@josharian
Copy link
Contributor

@josharian josharian commented Oct 30, 2021

How we detect asm routines during preempt changed in Go 1.18. (c2483a5#diff-47fd68949147d260b91998c0d6eabffcdb74c58991fc386b40e14a6ed710b17d)

Does this reproduce at tip? (I'm AFK.)

Loading

@leitzler
Copy link
Contributor

@leitzler leitzler commented Oct 30, 2021

It does, with:

% go version
go version devel go1.18-6113dacf32 Sat Oct 30 18:30:34 2021 +0000 darwin/amd64
% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
% sw_vers
ProductName:	macOS
ProductVersion:	11.6
BuildVersion:	20G165
% go test .
--- FAIL: TestMask (0.43s)
    maskcheck_test.go:16: Failed for iteration: 0 with return value: 88007197
    maskcheck_test.go:16: Failed for iteration: 1 with return value: 61939268
    maskcheck_test.go:16: Failed for iteration: 2 with return value: 50093958
    maskcheck_test.go:16: Failed for iteration: 3 with return value: 57687622
    maskcheck_test.go:16: Failed for iteration: 4 with return value: 55302035
    maskcheck_test.go:16: Failed for iteration: 5 with return value: 50302791
    maskcheck_test.go:16: Failed for iteration: 6 with return value: 55661062
    maskcheck_test.go:16: Failed for iteration: 7 with return value: 54177345
    maskcheck_test.go:16: Failed for iteration: 8 with return value: 58187395
    maskcheck_test.go:16: Failed for iteration: 9 with return value: 51991146
    maskcheck_test.go:16: Failed for iteration: 10 with return value: 56438664
    maskcheck_test.go:16: Failed for iteration: 11 with return value: 54676159
    maskcheck_test.go:16: Failed for iteration: 12 with return value: 50955817
    maskcheck_test.go:16: Failed for iteration: 13 with return value: 52912278
    maskcheck_test.go:16: Failed for iteration: 14 with return value: 57776646
    maskcheck_test.go:16: Failed for iteration: 15 with return value: 54628233
    maskcheck_test.go:16: Failed for iteration: 16 with return value: 55309645
    maskcheck_test.go:16: Failed for iteration: 17 with return value: 52719044
    maskcheck_test.go:16: Failed for iteration: 18 with return value: 57165517
    maskcheck_test.go:16: Failed for iteration: 19 with return value: 57336883
FAIL
FAIL	x	0.570s
FAIL

Loading

klauspost added a commit to klauspost/minio that referenced this issue Nov 1, 2021
Preemptively disable AVX512 until golang/go#49233 has been resolved.

This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages.

Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
harshavardhana pushed a commit to minio/minio that referenced this issue Nov 1, 2021
Preemptively disable AVX512 until golang/go#49233 has been resolved.

This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages.

Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 1, 2021

It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?

It seems unlikely. You can try building with Go 1.17 with GOEXPERIMENT=noregabi environment variable set (at go run/build/test time).

Also, you can try running the program with preemption disabled but CPU profiling enabled and see if profiling signal changes anything. You can also try running the program with preemption disabled while having a separate process sending SIGURG signals to it. Thanks.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

@cherrymui Good suggestions. Running the repro test case as:

 % GODEBUG=asyncpreemptoff=1 go test -cpuprofile cpu.out -count 1 -timeout 15m -run '^TestMask$' gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
--- FAIL: TestMask (0.26s)
    maskcheck_test.go:21: Failed for iteration: 0 with return value: 47429074
    maskcheck_test.go:21: Failed for iteration: 1 with return value: 45358536
    maskcheck_test.go:21: Failed for iteration: 2 with return value: 52675366
    maskcheck_test.go:21: Failed for iteration: 3 with return value: 44456167
    maskcheck_test.go:21: Failed for iteration: 4 with return value: 72420817
    maskcheck_test.go:21: Failed for iteration: 5 with return value: 72654067
    maskcheck_test.go:21: Failed for iteration: 6 with return value: 72091434
    maskcheck_test.go:21: Failed for iteration: 7 with return value: 73369163
    maskcheck_test.go:21: Failed for iteration: 8 with return value: 72828760
    maskcheck_test.go:21: Failed for iteration: 9 with return value: 73065114
    maskcheck_test.go:21: Failed for iteration: 10 with return value: 73038149
    maskcheck_test.go:21: Failed for iteration: 11 with return value: 72684990
    maskcheck_test.go:21: Failed for iteration: 12 with return value: 72653599
    maskcheck_test.go:21: Failed for iteration: 13 with return value: 72440260
    maskcheck_test.go:21: Failed for iteration: 14 with return value: 72201890
    maskcheck_test.go:21: Failed for iteration: 15 with return value: 45021003
    maskcheck_test.go:21: Failed for iteration: 16 with return value: 72012922
    maskcheck_test.go:21: Failed for iteration: 17 with return value: 44692569
    maskcheck_test.go:21: Failed for iteration: 18 with return value: 72992403
    maskcheck_test.go:21: Failed for iteration: 19 with return value: 73178568
FAIL
FAIL	gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32	0.601s
FAIL

This leads to the test failing, and with different looking stats (each iteration fails earlier in the loop on average, probably due to different signaling rates between profiling and preemption).

Two other data points:

  1. I have not yet observed corruption of actual ZMM vector state in my testing. From what I've seen this appears to be limited to the K0-K7 opmask registers.
  2. The above observation is curious in light of the fact that the Darwin kernel code I've been looking at is at least not obviously broken w.r.t to handling the opmask registers and AVX512 thread state in signal handling. Opmask data is present in all of the relevant state structures, there appear to be tests that check that it is properly saved and restored, etc. So whatever is happening is more subtle than mere omission of this detail in the kernel. See e.g.:
    https://github.com/apple/darwin-xnu/search?q=opmask

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 2, 2021

@vsivsi thanks for the repro! So it seems it is not related preemption but signals in general. The observation about K0-K7 also suggests VZEROUPPER is not the cause.

Would it be possible to write a similar test in C and assembly (GNU syntax) and see if it is failing?

Maybe we need a different CPU feature probe to tell the kernel we're using K0-K7? (I haven't looked at the code carefully. Just a guess.)

Loading

@cherrymui cherrymui changed the title runtime: async preemption clobbers AVX512 register state runtime: AVX512 register state clobbered by signal on macOS Nov 2, 2021
@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

@cherrymui The mechanism for using AVX-512 in Darwin doesn't involve telling the kernel which features to enable. The kernel simply advertises the available AVX-512 features in the process commpage data area. Once you've verified that a given required set of AVX-512 features is available, you simply use them. Darwin catches the UD interrupt for any AVX-512 instruction on first use, and promotes that thread to use the full AVX512 thread state, enabling every available feature.

Specifically, the three XCR0[5:7] bits (from XGETBV) indicate whether XSAVE will preserve:

  • Registers K0-K7 ("opmask state")
  • The upper 256 bits of ZMM0-15 ("ZMM_Hi256 state")
  • The full width of ZMM16-31 ("Hi16_ZMM state")

Upon thread promotion following the UD interrupt from a supported AVX512 instruction, Darwin sets XCR0[5:7] = 111b, for the thread, enabling the full "AVX512 state" in XSAVE.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

The kernel definitions of these XCR0 bits are here:
https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/proc_reg.h#L168-L188

The full AVX512 enabled mask is composed here:
https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.h#L71

And this is the function that runs when a UD for an AVX512 instruction is trapped, triggering the XSAVE state promotion:
https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.c#L1472

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

@cherrymui Here is an attempted reproduction of this issue in GNU C and Asm. In my testing it does not fail.

https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5

% gcc testmask.c testmask.s && ./a.out &
[1] 27524
% for ((x=0;x<1000000;x++)); do kill -s URG 27524; done
Test Passed!
[1]  + done       ./a.out

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 2, 2021

Thanks for trying that. In your C code, could you install a signal handler for SIGURG (which could simply return immediately)? Thanks.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

@cherrymui Gist updated with empty SIGURG handler. Now it fails as with go runtime:

% gcc testmask.c testmask.s && ./a.out &                        
[1] 28786
% for ((x=0;x<20;x++)); do kill -s URG 28786; sleep 0.01; done
Failed for iteration: 133 with return value: 90069936
Failed for iteration: 134 with return value: 79705026
Failed for iteration: 135 with return value: 79757341
Failed for iteration: 136 with return value: 79875856
Failed for iteration: 137 with return value: 82261267
Failed for iteration: 138 with return value: 83127231
Failed for iteration: 139 with return value: 83641375
Failed for iteration: 140 with return value: 83352568
Failed for iteration: 141 with return value: 82555528
Failed for iteration: 142 with return value: 79494222
Failed for iteration: 143 with return value: 78981164
Failed for iteration: 144 with return value: 81261765
Failed for iteration: 145 with return value: 80885813
Failed for iteration: 146 with return value: 79706999
Failed for iteration: 147 with return value: 83441689
Failed for iteration: 148 with return value: 80171870
Failed for iteration: 149 with return value: 79769830
Failed for iteration: 150 with return value: 81361308
Failed for iteration: 151 with return value: 81237987
Failed for iteration: 152 with return value: 79568372
Test failed 20 times!
[1]  + done       ./a.out

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 2, 2021

This reminds me of #37174 (comment)
Sounds like Apple needs some more regression tests for their sigaction code path.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

I'm sure you all are happy that this doesn't appear to be a problem in golang, but it still seems like a problem for golang, no?

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 2, 2021

Thanks @vsivsi ! That indicates the kernel's signal handling code doesn't seem to do the right thing.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

So what's the next step here? Throwing a process into an indeterminate state via external signal feels like a CVE-level issue.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 2, 2021

If there is a simple workaround that we could do it in the runtime, we'll do it, like the VZEROUPPER for #37174 . I don't know how to work around this one, though. One probably has to either not use K0-K7 registers or block all signals.

I guess the next step would be reporting to Apple and have them fix the kernel.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

While they're at it, they should fix this one too: #42649

Also, not using the K0-K7 opmasks == not using AVX-512 for all practical purposes.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 2, 2021

We need to report a bug to Apple. I can do that - I've done it before, and supposedly a report from Google has a bit more visibility than one from random internet user.
Any idea what range of OS versions this causes a problem on? I don't see a OS version in the OP.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

@randall77 I've seen it on Catalina (10.15.x) and Big Sur (11.x), but I bet it goes all the way back to the first AVX-512 support, which I believe was High Sierra (10.13.x) when the first iMac Pro released with a Xeon supporting AVX-512.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 2, 2021

As it is not safe to use, maybe we could consider hardwire AVX512 to false on Darwin until the kernel fix?

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 2, 2021

We could hardwire internal/cpu (and x/sys/cpu) flags to false for avx512. But if the asm uses cpuid itself to figure out what is available, I don't think we'd be able to prevent avx512 use in that case.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 2, 2021

The only influence golang has over this is the value set for x/sys/cpu.X86.HasAVX512 which anyone is free to ignore (at their potential peril). But temporarily disabling that won't solve the problem of existing go binaries, or running on unpatched Darwin after Apple releases a fix. The robust solution would be to actually test for the live vuln when setting x/sys/cpu.X86.HasAVX512 at runtime.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 3, 2021

Here's a simplified reproduction, just run it and see if it prints anything:

testmask.c

#include <stdio.h>
#include <inttypes.h>
#include <signal.h>
#include <unistd.h>
#include <pthread.h>

volatile int8_t done = 0;

// in testmask.s
uint32_t masktest(uint32_t x);

void sig_handler(int signum) {
  // Tell the assembly loop to return.
  done = 1;
}

void *tester(void* unused) {
  uint32_t x = 0xabcd1234;
  uint32_t y = masktest(x);
  if (x != y) {
    printf("bad: %x -> %x\n", x, y);
  }
  return NULL;
}

int main(int argc, char *argv[]) {
  // Set up simple signal handler.
  signal(SIGURG, sig_handler);

  // Start worker.
  pthread_t worker;
  pthread_create(&worker, NULL, tester, NULL);

  // Wait until the worker is in its assembly loop.
  usleep(1000);

  // Send a signal to the worker.
  pthread_kill(worker, SIGURG);

  // Wait until worker is done.
  pthread_join(worker, NULL);
}

testmask.s

	.globl _done
        .globl _masktest
_masktest:
	// Put a value in the k1 avx512 register.
	kmovd	%edi, %k1
	
	// Wait until we're told to return.
loop:	
	cmpb	$0, _done(%rip)
	je	loop

	// Return the value in the k1 register.
	kmovd	%k1, %eax
	ret

Build with gcc testmask.c testmask.s and run with ./a.out.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 3, 2021

Mac OSX bug report here: https://feedbackassistant.apple.com/feedback/9736652
(If you can see it, not sure how they deal with permissions.)

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 3, 2021

There doesn't seem to be an easy way to let anyone else see my bug report. So here's a screenshot of it:
feedbackassistant apple com_feedback_9736652
feedbackassistant apple com_feedback_9736652 (1)

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 3, 2021

Change https://golang.org/cl/361255 mentions this issue: cpu: pretend AVX-512 is disabled on Darwin

Loading

gopherbot pushed a commit to golang/sys that referenced this issue Nov 3, 2021
Darwin doesn't save/restore the AVX-512 mask registers when
processing a signal. Go generates lots of signals, which cause
the mask registers to randomly get clobbered. Better to pretend
that AVX-512 isn't supported.

Update golang/go#49233

Change-Id: I0dda77a6f205e10eefccd82e6f4f0f184dae42d9
Reviewed-on: https://go-review.googlesource.com/c/sys/+/361255
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Martin Möhrmann <martin@golang.org>
@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 4, 2021

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 4, 2021

As far as I can tell, the opmask register state is saved on entry to the signal handler, as the value from the opmask register appears in a dump of the signal context from within the signal handler.
The problem must be on the return side.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 8, 2021

Seemed like a long-shot, but I just re-tested after updating MacOS with this morning's security updates.
This issue persists in the latest builds:

  System Version:	macOS 11.6.1 (20G224)
  Kernel Version:	Darwin 20.6.0
System Version:	macOS 10.15.7 (19H1519)
Kernel Version:	Darwin 19.6.0

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 9, 2021

@randall77 @cherrymui

I'm nearly certain I've found the kernel bug responsible for this issue by inspection.

It looks like it was introduced in Catalina 10.15.6, and has always existed in Big Sur 11.x. I'm presently installing Monterey on one of my MacBook Pros to see what the status there is, since Apple hasn't yet released the Darwin source for MacOS 12.x. Edit to confirm: The issue also exists in Monterey 12.0.1 / Darwin 21.1.0 (21A559).

It seems not to exist in Mojave 10.14.x or before, although that would only impact the iMac Pro, since that was the only pre-10.15 machine with AVX-512.

The issue is a set of changes introduced in 10.15.6 that are aimed at improving performance when AVX-512 is on, but the thread isn't using the ZMM extended state:

  • The upper 256 bits of ZMM0-15 ("ZMM_Hi256 state")
  • The full width of ZMM16-31 ("Hi16_ZMM state")

This is analogous to your VZEROUPPER optimization for eliminating phantom SSE4 dependencies. But, when the code detects that ZMM_Hi256 state and Hi16_ZMM state are all zeros, it disables XSAVE restore of the "opmask state" as well. This is not a safe assumption, as it is perfectly valid to have non-zero values in K0-K7 while having all zeroed vector registers (as our various reproductions likely illustrate). With AVX-512VL, it is also valid to use EVEX encoded AVX-512 instructions, with opmasks, while only touching the AVX YMM vector state.

My hypothesis above is confirmed by adding a single instruction to the reproduction assembly code sent to Apple:

testmask.s

        .globl _done
        .globl _masktest
_masktest:

        // Storing a non-zero value in ZMM31 suppresses the kernel bug.
        vpbroadcastd  %edi, %zmm31

        // Put a value in the k1 avx512 register.
        kmovd   %edi, %k1

        // Wait until we're told to return.
loop:
        cmpb    $0, _done(%rip)
        je      loop

        // Return the value in the k1 register.
        kmovd   %k1, %eax
        ret

So, here's the offending code in 10.15.6:

if (fpu_allzeroes((uint64_t *)(void *)iavx->x_Hi16_ZMM, 16 * sizeof(_STRUCT_ZMM_REG)) == TRUE &&
    fpu_allzeroes((uint64_t *)(void *)iavx->x_ZMM_Hi256, 16 * sizeof(_STRUCT_YMM_REG)) == TRUE) {
	iavx->_xh.xstate_bv &= ~XFEM_ZMM;
}

From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L954

Where:

#define XFEM_ZMM        (XFEM_ZMM_HI256 | XFEM_HI16_ZMM | XFEM_OPMASK)

From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/proc_reg.h#L187

This change was introduced in Darwin xnu-6153.141.1, which corresponds with MacOS10.15.6.

Here's the diff:
apple/darwin-xnu@2a1b271#diff-eebcfce045ea71301ebd7644a71e52b7836189410512149ef8e8e6460a741446R952-R955

So, the fix is to probably replace a single line of code in two locations (for the 32 and 64 bit cases):

	// iavx->_xh.xstate_bv &= ~XFEM_ZMM;
	iavx->_xh.xstate_bv &= ~(XFEM_ZMM_HI256 | XFEM_HI16_ZMM);

Here: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L928
and
Here: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L954

Now we just need to get Apple to do it. In the meantime there's a weak-ish workaround... Ensure that there's always non-zero values in some part of the AVX-512 extended ZMM state.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 9, 2021

Weird. When debugging I thought a use of a zmm register might help, so I added:

	vpxor	%zmm1, %zmm1, %zmm1

and it didn't help. Apparently I needed to set zmm1 to a nonzero value to really "use" that register :(

I'll update my Apple feedback submission.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 9, 2021

@randall77 Yeah, I went back and checked the original rarely failing unit test case that kicked this all off... It was Avo generated code that only used the YMM registers with opmasks in a 4x64 bit special case. I didn't pick up on the significance of that while performing the debugging that ultimately led to the repro I brought to this issue.

In thinking about the kernel bug presented above overnight, it clarified three things:

  • The scope of the bug in time is now clear. Catalina 10.15.6 was released in mid-July 2020. So this has been in the wild for only around 16 months.
  • The scope of the bug in terms of the sufficient processor state is considerably narrowed: it only occurs when all bits in both the ZMM_Hi256 and Hi16_ZMM state are zero. There are still real valid cases when that can be true, but a lot fewer than were originally assumed.
  • The scope of the bug in terms of how it is triggered has very likely expanded. All of our reproductions depend on BSD signal handling to produce an easily repeatable trigger. But the implicated kernel code is not specific to signal handler returns. It seems likely that other kernel mechanisms that require restoring thread state could also trigger this. Perhaps even ordinary multitasking preemption from the Darwin thread scheduler.

On reflection, I'd also like to add that the "one line fix" I presented above may not simultaneously resolve this issue while still satisfying the aims of the Darwin kernel engineers in making the revisions that appeared in Catalina 10.15.6. Another possible fix would be to leave the masking code the same, but add an additional check to ensure that the opmask state also has all zeroed bits. Or perhaps the opmask save/restore should be broken out as a separate case (combining the two approaches).

In any case, I hope Apple carefully audits all uses of XFEM_ZMM (perhaps renaming it in the process to make clear it encompasses the opmask bit), and adds regression testing to ensure this type of confusion doesn't reoccur.

Loading

@vsivsi
Copy link
Author

@vsivsi vsivsi commented Nov 10, 2021

Just a quick update on point 3) in the post above. I've run the reproduction C/Asm code here:
https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5

As:

for ((x=0;x<250;x++)); do ./a.out & done

When run on a Mac Pro with 48 hyperthreads, each of the 250 processes created above is preemptively multitasked onto ~20% of a hyperthread with a large number of kernel scheduler preemptions per second.

In over an hour of CPU time, I've seen zero opmask corruptions. So I think it's fair to conclude that Darwin's preemptive multitasking does not use the faulty thread state resetting code spotted above.

😅

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants