runtime: Darwin slow when using signals + int->float instructions #37174

randall77 · 2020-02-11T21:12:47Z

Split off from #37121

package bench1

import (
	"math"
	"testing"
)

const N = 64

func BenchmarkFast(b *testing.B) {
	var x, y, z [N]float32

	for i := 0; i < b.N; i++ {
		mulFast(&x, &y, &z)
	}
}

func mulFast(x, y, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = x[i] * y[i]
	}
}

func BenchmarkSlow(b *testing.B) {
	var z [N]float32
	var x, y [N]uint32

	for i := 0; i < b.N; i++ {
		mulSlow(&x, &y, &z)
	}
}

func mulSlow(x, y *[N]uint32, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
	}
}

% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x 
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        55.9 ns/op
BenchmarkSlow-16    	10000000	        61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        89.7 ns/op
BenchmarkSlow-16    	10000000	       223 ns/op
PASS

For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.

This bug is reproducible from at least 1.11.

The text was updated successfully, but these errors were encountered:

gopherbot · 2020-02-11T22:25:52Z

Change https://golang.org/cl/219127 mentions this issue: cmd/compile: add rewrite to avoid MOVLi2f

dr2chase · 2020-02-11T22:27:31Z

@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.

dr2chase · 2020-02-11T22:35:15Z

@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee!

randall77 · 2020-02-11T23:06:46Z

Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go.

dr2chase · 2020-02-11T23:10:06Z

We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....

cherrymui · 2020-02-11T23:11:11Z

@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help...

randall77 · 2020-02-11T23:31:59Z

Here's a C/assembly reproducer:

main.c

#include <stdio.h>
#include <stdint.h>
#include <signal.h>

#define N 64
#define SIG SIGFPE

// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);

uint64_t x[N];
double y[N];
double z[N];

void handler(int sig, siginfo_t *info, void *arg) {
}

int main(int argc, char *argv[]) {
  // Install handler for a signal.
  // sa_sigaction is required; sa_handler doesn't trigger the bug.
  struct sigaction act;
  act.sa_sigaction = handler;
  act.sa_flags |= SA_SIGINFO;
  sigaction(SIG, &act, NULL);
  // Raise the signal, which dirties the upper bits of YMM registers.
  raise(SIG);

  // Run the benchmark.
  for (int i = 0; i < 40000000; i++) {
    add(&x[0], &y[0], &z[0]);
  }
  return 0;
}

add.s:

	.globl	_add
_add:
	// Uncomment the vzeroupper to "fix" the bug.
	//vzeroupper
	movq	$0, %rax
loop:
	cmpq	$64, %rax
	je	done
	movq	(%rdi, %rax, 8), %rcx
	movq	%rcx, %xmm0
	addsd	(%rsi, %rax, 8), %xmm0
	movsd	%xmm0, (%rdx, %rax, 8)
	incq	%rax
	jmp	loop

done:
	ret

cherrymui · 2020-02-11T23:50:33Z

I also tried adding VZEROUPPER to asyncPreempt, it didn't help either.

That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster.

Before:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	14685382	        90.9 ns/op
BenchmarkSlow-8   	 8348937	       184 ns/op
PASS
ok  	_/tmp/pp	3.121s

After:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	24309422	        48.6 ns/op
BenchmarkSlow-8   	25514097	        49.0 ns/op
PASS
ok  	_/tmp/pp	2.553s

gopherbot · 2020-02-11T23:56:52Z

Change https://golang.org/cl/219131 mentions this issue: runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64

mknyszek · 2020-02-12T00:09:09Z

@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.

randall77 · 2020-02-12T00:41:29Z

On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out.
On 1.12, ~13% slower.
On 1.14, ~11% faster.
On 1.15, ~11% faster.

So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.

randall77 · 2020-02-12T00:42:17Z

I reported a bug to apple here.
We'll see if we get any traction on that.

Apparently, the signal handling code path in darwin kernel leaves the upper bits of Y registers in a dirty state, which causes many SSE operations (128-bit and narrower) become much slower. Clear the upper bits to get to a clean state. We do it at the entry of asyncPreempt, which is immediately following exiting from the kernel's signal handling code, if we actually injected a call. It does not cover other exits where we don't inject a call, e.g. failed preemption, profiling signal, or other async signals. But it does cover an important use case of async signals, preempting a tight numerical loop, which we introduced in this cycle. Running the benchmark in issue #37174: name old time/op new time/op delta Fast-8 90.0ns ± 1% 46.8ns ± 3% -47.97% (p=0.000 n=10+10) Slow-8 188ns ± 5% 49ns ± 1% -73.82% (p=0.000 n=10+9) There is no more slowdown due to preemption signals. For #37174. Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605 Reviewed-on: https://go-review.googlesource.com/c/go/+/219131 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

odeke-em · 2020-05-18T02:54:28Z

@randall77, unfortunately your bug report to Apple seems to return a 404.

randall77 · 2020-05-18T03:21:18Z

The link works fine for me. It is behind a login, though. Not sure if that would give you a 404 or not.

odeke-em · 2020-05-18T03:44:51Z

I did login and tried again but in vain

Anyways it is not a problem, was just checking up on this issue as I was re-triaging bugs that are in the Go1.15 milestone. Sorry for the noise :)

ianlancetaylor · 2020-06-16T02:47:23Z

Do we think that there is anything to fix in Go here?

randall77 · 2020-06-16T04:22:02Z

No, I think there's nothing left on our end.

The bug I linked to above now says:

Recent Similar Reports:Less than 10
Resolution:Potential fix identified - For a future OS update

Pretty vague, but at least something is happening.

Maybe when something is released, we can circle back and check the fix works, and then maybe remove our workaround (CL 219131) when the buggy versions of OSX become unsupported.
No need to keep the issue open, though.

randall77 · 2020-08-31T15:53:21Z

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

cuonglm · 2020-08-31T16:19:31Z

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

Nothing happens:

$ time ./a.out
=====
JOB ./a.out
99%    cpu
1.97s real
1.96s user
0.00s sys

randall77 · 2020-08-31T16:37:27Z

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

cuonglm · 2020-08-31T16:40:26Z

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

Here's with VZEROUPPER commented out:

=====
JOB ./a.out
99%    cpu
1.88s real
1.88s user
0.00s sys

So it's still slightly faster than with VZEROUPPER commented (I posted above).

randall77 · 2020-08-31T16:50:10Z

Ok, sounds fixed.
So once we stop supporting releases <10.15.6, we can get rid of the VZEROUPPER patch.
We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

dmitshur · 2020-08-31T16:55:00Z

We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

We don't have the ability to track minimum OS support yet, so the best alternative I can suggest for now is to open a new tracking issue.

randall77 · 2020-11-17T03:50:31Z

@HenkPoley thinks this might have regressed in 10.15.7. See his comment in #41152.

HenkPoley · 2020-11-17T07:18:59Z

Just for the record. I have no system to test this on. I don't run Go. Just gave a headsup that the issue in macOS that Apple forgets to use VZEROUPPER in their code (causing issues here) seems to have re-appeared. Which was the original issue, isn't it?

I don't think Go need to do anything, except shift that 'remove VZEROUPPER patch once Darwin <10.15.6 is not supported #41152' to =<10.15.7

randall77 · 2020-11-17T07:24:04Z

Right, sorry, it's not that there's any fix to do for this bug. We just need to re-evaluate the point at which the currently checked-in fix will be superfluous.

dmitshur · 2020-12-03T22:23:39Z

@randall77 Does this issue still need to be open, or is it enough that #41152 is open?

It's currently milestoned for Go 1.15, so if it needs to stays open, we should move it to a milestone that isn't in the past. If there's something different to do, it might be more clear to open a new issue and reference this one.

randall77 · 2020-12-04T22:12:13Z

Yes, #41152 is all we need now.

randall77 added the OS-Darwin label Feb 11, 2020

randall77 added this to the Go1.14 milestone Feb 11, 2020

randall77 mentioned this issue Feb 11, 2020

runtime: unaligned jumps causing performance regression on Intel #37121

Closed

dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 11, 2020

toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020

randall77 mentioned this issue Feb 26, 2020

SIGILL: illegal instruction on any go tool under macOS #37459

Closed

dmitshur mentioned this issue Feb 26, 2020

runtime: dependency on AVX instruction set added in Go 1.14 on macOS #37477

Closed

randall77 closed this as completed Jun 16, 2020

davecheney mentioned this issue Aug 31, 2020

Proposal: Go 2: add builtin function printf, sprint(/f/ln) #40856

Closed

cuonglm mentioned this issue Sep 1, 2020

runtime: remove VZEROUPPER patch once Darwin <10.15.6 is not supported #41152

Closed

randall77 reopened this Nov 17, 2020

randall77 closed this as completed Dec 4, 2020

randall77 mentioned this issue Oct 29, 2021

runtime: AVX512 register state clobbered by signal on macOS #49233

Open

golang locked and limited conversation to collaborators Dec 4, 2021

gopherbot added the FrozenDueToAge label Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Darwin slow when using signals + int->float instructions #37174

runtime: Darwin slow when using signals + int->float instructions #37174

randall77 commented Feb 11, 2020

gopherbot commented Feb 11, 2020

dr2chase commented Feb 11, 2020

dr2chase commented Feb 11, 2020

randall77 commented Feb 11, 2020

dr2chase commented Feb 11, 2020

cherrymui commented Feb 11, 2020

randall77 commented Feb 11, 2020

cherrymui commented Feb 11, 2020

gopherbot commented Feb 11, 2020

mknyszek commented Feb 12, 2020 •

edited

Loading

randall77 commented Feb 12, 2020

randall77 commented Feb 12, 2020

odeke-em commented May 18, 2020

randall77 commented May 18, 2020

odeke-em commented May 18, 2020

ianlancetaylor commented Jun 16, 2020

randall77 commented Jun 16, 2020

randall77 commented Aug 31, 2020

cuonglm commented Aug 31, 2020

randall77 commented Aug 31, 2020 •

edited

Loading

cuonglm commented Aug 31, 2020

randall77 commented Aug 31, 2020

dmitshur commented Aug 31, 2020

randall77 commented Nov 17, 2020

HenkPoley commented Nov 17, 2020 •

edited

Loading

randall77 commented Nov 17, 2020

dmitshur commented Dec 3, 2020 •

edited

Loading

randall77 commented Dec 4, 2020

runtime: Darwin slow when using signals + int->float instructions #37174

runtime: Darwin slow when using signals + int->float instructions #37174

Comments

randall77 commented Feb 11, 2020

gopherbot commented Feb 11, 2020

dr2chase commented Feb 11, 2020

dr2chase commented Feb 11, 2020

randall77 commented Feb 11, 2020

dr2chase commented Feb 11, 2020

cherrymui commented Feb 11, 2020

randall77 commented Feb 11, 2020

cherrymui commented Feb 11, 2020

gopherbot commented Feb 11, 2020

mknyszek commented Feb 12, 2020 • edited Loading

randall77 commented Feb 12, 2020

randall77 commented Feb 12, 2020

odeke-em commented May 18, 2020

randall77 commented May 18, 2020

odeke-em commented May 18, 2020

ianlancetaylor commented Jun 16, 2020

randall77 commented Jun 16, 2020

randall77 commented Aug 31, 2020

cuonglm commented Aug 31, 2020

randall77 commented Aug 31, 2020 • edited Loading

cuonglm commented Aug 31, 2020

randall77 commented Aug 31, 2020

dmitshur commented Aug 31, 2020

randall77 commented Nov 17, 2020

HenkPoley commented Nov 17, 2020 • edited Loading

randall77 commented Nov 17, 2020

dmitshur commented Dec 3, 2020 • edited Loading

randall77 commented Dec 4, 2020

mknyszek commented Feb 12, 2020 •

edited

Loading

randall77 commented Aug 31, 2020 •

edited

Loading

HenkPoley commented Nov 17, 2020 •

edited

Loading

dmitshur commented Dec 3, 2020 •

edited

Loading