Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Darwin slow when using signals + int->float instructions #37174

Open
randall77 opened this issue Feb 11, 2020 · 12 comments
Open

runtime: Darwin slow when using signals + int->float instructions #37174

randall77 opened this issue Feb 11, 2020 · 12 comments

Comments

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 11, 2020

Split off from #37121

package bench1

import (
	"math"
	"testing"
)

const N = 64

func BenchmarkFast(b *testing.B) {
	var x, y, z [N]float32

	for i := 0; i < b.N; i++ {
		mulFast(&x, &y, &z)
	}
}

func mulFast(x, y, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = x[i] * y[i]
	}
}

func BenchmarkSlow(b *testing.B) {
	var z [N]float32
	var x, y [N]uint32

	for i := 0; i < b.N; i++ {
		mulSlow(&x, &y, &z)
	}
}

func mulSlow(x, y *[N]uint32, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
	}
}
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x 
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        55.9 ns/op
BenchmarkSlow-16    	10000000	        61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        89.7 ns/op
BenchmarkSlow-16    	10000000	       223 ns/op
PASS

For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.

This bug is reproducible from at least 1.11.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 11, 2020

Change https://golang.org/cl/219127 mentions this issue: cmd/compile: add rewrite to avoid MOVLi2f

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee!

@randall77

This comment has been minimized.

Copy link
Contributor Author

@randall77 randall77 commented Feb 11, 2020

Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go.

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 11, 2020

We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Feb 11, 2020

@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help...

@randall77

This comment has been minimized.

Copy link
Contributor Author

@randall77 randall77 commented Feb 11, 2020

Here's a C/assembly reproducer:

main.c

#include <stdio.h>
#include <stdint.h>
#include <signal.h>

#define N 64
#define SIG SIGFPE

// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);

uint64_t x[N];
double y[N];
double z[N];

void handler(int sig, siginfo_t *info, void *arg) {
}

int main(int argc, char *argv[]) {
  // Install handler for a signal.
  // sa_sigaction is required; sa_handler doesn't trigger the bug.
  struct sigaction act;
  act.sa_sigaction = handler;
  act.sa_flags |= SA_SIGINFO;
  sigaction(SIG, &act, NULL);
  // Raise the signal, which dirties the upper bits of YMM registers.
  raise(SIG);

  // Run the benchmark.
  for (int i = 0; i < 40000000; i++) {
    add(&x[0], &y[0], &z[0]);
  }
  return 0;
}

add.s:

	.globl	_add
_add:
	// Uncomment the vzeroupper to "fix" the bug.
	//vzeroupper
	movq	$0, %rax
loop:
	cmpq	$64, %rax
	je	done
	movq	(%rdi, %rax, 8), %rcx
	movq	%rcx, %xmm0
	addsd	(%rsi, %rax, 8), %xmm0
	movsd	%xmm0, (%rdx, %rax, 8)
	incq	%rax
	jmp	loop

done:
	ret
@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Feb 11, 2020

I also tried adding VZEROUPPER to asyncPreempt, it didn't help either.

That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster.

Before:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	14685382	        90.9 ns/op
BenchmarkSlow-8   	 8348937	       184 ns/op
PASS
ok  	_/tmp/pp	3.121s

After:

$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8   	24309422	        48.6 ns/op
BenchmarkSlow-8   	25514097	        49.0 ns/op
PASS
ok  	_/tmp/pp	2.553s
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 11, 2020

Change https://golang.org/cl/219131 mentions this issue: runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Feb 12, 2020

@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.

@randall77

This comment has been minimized.

Copy link
Contributor Author

@randall77 randall77 commented Feb 12, 2020

On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out.
On 1.12, ~13% slower.
On 1.14, ~11% faster.
On 1.15, ~11% faster.

So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.

@randall77

This comment has been minimized.

Copy link
Contributor Author

@randall77 randall77 commented Feb 12, 2020

I reported a bug to apple here.
We'll see if we get any traction on that.

gopherbot pushed a commit that referenced this issue Feb 13, 2020
Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.

We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.

Running the benchmark in issue #37174:

name    old time/op  new time/op  delta
Fast-8  90.0ns ± 1%  46.8ns ± 3%  -47.97%  (p=0.000 n=10+10)
Slow-8   188ns ± 5%    49ns ± 1%  -73.82%  (p=0.000 n=10+9)

There is no more slowdown due to preemption signals.

For #37174.

Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605
Reviewed-on: https://go-review.googlesource.com/c/go/+/219131
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.