Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Darwin slow when using signals + int->float instructions #37174

Closed
randall77 opened this issue Feb 11, 2020 · 23 comments
Closed

runtime: Darwin slow when using signals + int->float instructions #37174

randall77 opened this issue Feb 11, 2020 · 23 comments

Comments

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 11, 2020

Split off from #37121

package bench1

import (
	"math"
	"testing"
)

const N = 64

func BenchmarkFast(b *testing.B) {
	var x, y, z [N]float32

	for i := 0; i < b.N; i++ {
		mulFast(&x, &y, &z)
	}
}

func mulFast(x, y, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = x[i] * y[i]
	}
}

func BenchmarkSlow(b *testing.B) {
	var z [N]float32
	var x, y [N]uint32

	for i := 0; i < b.N; i++ {
		mulSlow(&x, &y, &z)
	}
}

func mulSlow(x, y *[N]uint32, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
	}
}
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x 
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        55.9 ns/op
BenchmarkSlow-16    	10000000	        61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        89.7 ns/op
BenchmarkSlow-16    	10000000	       223 ns/op
PASS

For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.

This bug is reproducible from at least 1.11.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 11, 2020

Change https://golang.org/cl/219127 mentions this issue: cmd/compile: add rewrite to avoid MOVLi2f

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee!

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 11, 2020

Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 11, 2020

@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help...

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 11, 2020

Here's a C/assembly reproducer:

main.c

#include <stdio.h>
#include <stdint.h>
#include <signal.h>

#define N 64
#define SIG SIGFPE

// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);

uint64_t x[N];
double y[N];
double z[N];

void handler(int sig, siginfo_t *info, void *arg) {
}

int main(int argc, char *argv[]) {
  // Install handler for a signal.
  // sa_sigaction is required; sa_handler doesn't trigger the bug.
  struct sigaction act;
  act.sa_sigaction = handler;
  act.sa_flags |= SA_SIGINFO;
  sigaction(SIG, &act, NULL);
  // Raise the signal, which dirties the upper bits of YMM registers.
  raise(SIG);

  // Run the benchmark.
  for (int i = 0; i < 40000000; i++) {
    add(&x[0], &y[0], &z[0]);
  }
  return 0;
}

add.s:

	.globl	_add
_add:
	// Uncomment the vzeroupper to "fix" the bug.
	//vzeroupper
	movq	$0, %rax
loop:
	cmpq	$64, %rax
	je	done
	movq	(%rdi, %rax, 8), %rcx
	movq	%rcx, %xmm0
	addsd	(%rsi, %rax, 8), %xmm0
	movsd	%xmm0, (%rdx, %rax, 8)
	incq	%rax
	jmp	loop

done:
	ret
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 11, 2020

I also tried adding VZEROUPPER to asyncPreempt, it didn't help either.

That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster.

Before:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	14685382	        90.9 ns/op
BenchmarkSlow-8   	 8348937	       184 ns/op
PASS
ok  	_/tmp/pp	3.121s

After:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	24309422	        48.6 ns/op
BenchmarkSlow-8   	25514097	        49.0 ns/op
PASS
ok  	_/tmp/pp	2.553s
@gopherbot
Copy link

@gopherbot gopherbot commented Feb 11, 2020

Change https://golang.org/cl/219131 mentions this issue: runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 12, 2020

@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 12, 2020

On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out.
On 1.12, ~13% slower.
On 1.14, ~11% faster.
On 1.15, ~11% faster.

So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 12, 2020

I reported a bug to apple here.
We'll see if we get any traction on that.

gopherbot pushed a commit that referenced this issue Feb 13, 2020
Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.

We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.

Running the benchmark in issue #37174:

name    old time/op  new time/op  delta
Fast-8  90.0ns ± 1%  46.8ns ± 3%  -47.97%  (p=0.000 n=10+10)
Slow-8   188ns ± 5%    49ns ± 1%  -73.82%  (p=0.000 n=10+9)

There is no more slowdown due to preemption signals.

For #37174.

Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605
Reviewed-on: https://go-review.googlesource.com/c/go/+/219131
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
@odeke-em
Copy link
Member

@odeke-em odeke-em commented May 18, 2020

@randall77, unfortunately your bug report to Apple seems to return a 404.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented May 18, 2020

The link works fine for me. It is behind a login, though. Not sure if that would give you a 404 or not.

@odeke-em
Copy link
Member

@odeke-em odeke-em commented May 18, 2020

I did login and tried again but in vain
3F389383-7881-4363-BA4B-10EB186F360F

Anyways it is not a problem, was just checking up on this issue as I was re-triaging bugs that are in the Go1.15 milestone. Sorry for the noise :)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 16, 2020

Do we think that there is anything to fix in Go here?

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Jun 16, 2020

No, I think there's nothing left on our end.

The bug I linked to above now says:

Recent Similar Reports:Less than 10
Resolution:Potential fix identified - For a future OS update

Pretty vague, but at least something is happening.

Maybe when something is released, we can circle back and check the fix works, and then maybe remove our workaround (CL 219131) when the buggy versions of OSX become unsupported.
No need to keep the issue open, though.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Aug 31, 2020

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

@cuonglm
Copy link
Contributor

@cuonglm cuonglm commented Aug 31, 2020

Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)

Nothing happens:

$ time ./a.out
=====
JOB ./a.out
99%    cpu
1.97s real
1.96s user
0.00s sys
@randall77
Copy link
Contributor Author

@randall77 randall77 commented Aug 31, 2020

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

@cuonglm
Copy link
Contributor

@cuonglm cuonglm commented Aug 31, 2020

Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.

Here's with VZEROUPPER commented out:

=====
JOB ./a.out
99%    cpu
1.88s real
1.88s user
0.00s sys

So it's still slightly faster than with VZEROUPPER commented (I posted above).

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Aug 31, 2020

Ok, sounds fixed.
So once we stop supporting releases <10.15.6, we can get rid of the VZEROUPPER patch.
We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Aug 31, 2020

We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

We don't have the ability to track minimum OS support yet, so the best alternative I can suggest for now is to open a new tracking issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.