New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: keeping a subset of a struct value's fields is slow #24386

Open
mvdan opened this Issue Mar 14, 2018 · 13 comments

Comments

Projects
None yet
6 participants
@mvdan
Member

mvdan commented Mar 14, 2018

What version of Go are you using (go version)?

go version devel +e601c07908 Tue Mar 13 22:02:46 2018 +0000 linux/amd64

What did you do?

https://play.golang.org/p/H9LaiI3BP-U

See the comments as to why I prefer writing the slower version, as it helps maintain the software in the long run.

What did you expect to see?

The two to perform the same, or at least comparably.

What did you see instead?

The second method being easily 10x as slow.

BenchmarkResetUnwanted-4     2000000000      0.72 ns/op     0 B/op     0 allocs/op
BenchmarkKeepWanted-4        200000000       9.72 ns/op     0 B/op     0 allocs/op

I can imagine why this is; naively, the second function does more work. It has to write the entire struct - 7 ints - instead of just three ints, like the first one. However, I'd hope that the compiler can learn to recognise assignments like these to avoid the unnecessary extra work.

@mvdan

This comment has been minimized.

Member

mvdan commented Mar 14, 2018

This also seems somewhat common in the standard library:

$ gogrep '*$x = $T{$*_, $k: $x.$v, $*_}' std
compress/flate/dict_decoder.go:40:2: *dd = dictDecoder{hist: dd.hist}
compress/flate/inflate.go:766:2: *f = decompressor{r: makeReader(r), bits: f.bits, codebits: f.codebits, dict: f.dict, step: (*decompressor).nextBlock}
compress/gzip/gunzip.go:104:2: *z = Reader{decompressor: z.decompressor, multistream: true}
compress/zlib/reader.go:130:2: *z = reader{decompressor: z.decompressor}

There may be more occurrences where the code was replaced with the faster version.

@mvdan

This comment has been minimized.

Member

mvdan commented Mar 14, 2018

Also, I realise that this example is perhaps a bit extreme, since the subset of the fields I want to keep is contiguous. However, I imagine that this would still bring some gains in real code.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 14, 2018

Nice observation. Implements of this optimization need to take care with panics, though—that both the partially-written memory state is correct and that the panic occurs on the correct line (with a multi-line struct literal). Not a show-stopper, just requires some care.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Mar 14, 2018

Looking at asm fast case uses 3 instructions to store zeros, while slow case does a lot of stuff:
1)Zeros 56 bytes of stack via 4 instructions
2)Does 4 8-byte copies via 8=2*4 instructions
3) Copies struct from stack to global sink

I think this is a known issue: structs with 4+ fields are treated differently , if I reduce number of fields to 2 (1 wanted,1 unwanted) both cases generate the same code. I think @randall77 knows exact issue number for structs with multiple fields.

@randall77

This comment has been minimized.

Contributor

randall77 commented Mar 14, 2018

I'm not sure there's an issue for improving large struct performance. I've been thinking about it, though.

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 14, 2018

@randall77 probably worth filing one. I've been wondering whether a large struct / large array issue might also help with #23929 in some cases.

@mvdan

This comment has been minimized.

Member

mvdan commented Mar 14, 2018

So the performance difference comes from the struct size, not from the different ways to clear some of the fields? Is there an optimization pass or rule that does this, i.e. avoiding the extra writes/work? Even for small structs, I would assume they would generate different code.

@andybons andybons added this to the Unplanned milestone Mar 15, 2018

@gopherbot

This comment has been minimized.

gopherbot commented Mar 15, 2018

Change https://golang.org/cl/100837 mentions this issue: cmd/compile: optimize struct partial self-assign

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Mar 15, 2018

It isn't exactly about size, but about number of fields. For <4 fields
*t = Foo{a:t.a} is represented as filed by field assignment.
So something like:

tmp = t.a
t.a = tmp
t.x = 0

Then t.a self assignment gets optimized away and we are left with t.x = 0 for both cases.

@mvdan

This comment has been minimized.

Member

mvdan commented Mar 15, 2018

Thanks for the clarification!

@randall77

This comment has been minimized.

Contributor

randall77 commented Mar 15, 2018

Filed #24416 for large struct optimization.

@randall77

This comment has been minimized.

Contributor

randall77 commented Apr 16, 2018

The underlying concern, that code for zeroing a new field might be missed, can be addressed with a test.

func TestResetUnwanted(t *testing.T) {
	// Note: unkeyed constructor. It will not compile if new fields are added.
	x := T{1, 1, 1, 1, 1, 1, 1}
	// Reset unexported fields.
	ResetUnwanted(&x)
	// Reset exported fields with reflect (or any other method to enumerate the "wanted" fields).
	v := reflect.ValueOf(&x).Elem()
	for i := 0; i < v.Type().NumField(); i++ {
		if v.Type().Field(i).PkgPath == "" { // exported field
			v.Field(i).Set(reflect.Zero(v.Field(i).Type()))
		}
	}

	// Everything should now be zero.
	if x != (T{}) {
		t.Errorf("ResetUnwanted missed a field %v\n", x)
	}
}
@gopherbot

This comment has been minimized.

gopherbot commented Apr 29, 2018

Change https://golang.org/cl/106495 mentions this issue: cmd/compile: add some generic composite type optimizations

gopherbot pushed a commit that referenced this issue May 8, 2018

cmd/compile: add some generic composite type optimizations
Propagate values through some wide Zero/Move operations. Among
other things this allows us to optimize some kinds of array
initialization. For example, the following code no longer
requires a temporary be allocated on the stack. Instead it
writes the values directly into the return value.

func f(i uint32) [4]uint32 {
    return [4]uint32{i, i+1, i+2, i+3}
}

The return value is unnecessarily cleared but removing that is
probably a task for dead store analysis (I think it needs to
be able to match multiple Store ops to wide Zero ops).

In order to reliably remove stack variables that are rendered
unnecessary by these new rules I've added a new generic version
of the unread autos elimination pass.

These rules are triggered more than 5000 times when building and
testing the standard library.

Updates #15925 (fixes for arrays of up to 4 elements).
Updates #24386 (fixes for up to 4 kept elements).
Updates #24416.

compilebench results:

name       old time/op       new time/op       delta
Template         353ms ± 5%        359ms ± 3%    ~     (p=0.143 n=10+10)
Unicode          219ms ± 1%        217ms ± 4%    ~     (p=0.740 n=7+10)
GoTypes          1.26s ± 1%        1.26s ± 2%    ~     (p=0.549 n=9+10)
Compiler         6.00s ± 1%        6.08s ± 1%  +1.42%  (p=0.000 n=9+8)
SSA              15.3s ± 2%        15.6s ± 1%  +2.43%  (p=0.000 n=10+10)
Flate            237ms ± 2%        240ms ± 2%  +1.31%  (p=0.015 n=10+10)
GoParser         285ms ± 1%        285ms ± 1%    ~     (p=0.878 n=8+8)
Reflect          797ms ± 3%        807ms ± 2%    ~     (p=0.065 n=9+10)
Tar              334ms ± 0%        335ms ± 4%    ~     (p=0.460 n=8+10)
XML              419ms ± 0%        423ms ± 1%  +0.91%  (p=0.001 n=7+9)
StdCmd           46.0s ± 0%        46.4s ± 0%  +0.85%  (p=0.000 n=9+9)

name       old user-time/op  new user-time/op  delta
Template         337ms ± 3%        346ms ± 5%    ~     (p=0.053 n=9+10)
Unicode          205ms ±10%        205ms ± 8%    ~     (p=1.000 n=10+10)
GoTypes          1.22s ± 2%        1.21s ± 3%    ~     (p=0.436 n=10+10)
Compiler         5.85s ± 1%        5.93s ± 0%  +1.46%  (p=0.000 n=10+8)
SSA              14.9s ± 1%        15.3s ± 1%  +2.62%  (p=0.000 n=10+10)
Flate            229ms ± 4%        228ms ± 6%    ~     (p=0.796 n=10+10)
GoParser         271ms ± 3%        275ms ± 4%    ~     (p=0.165 n=10+10)
Reflect          779ms ± 5%        775ms ± 2%    ~     (p=0.971 n=10+10)
Tar              317ms ± 4%        319ms ± 5%    ~     (p=0.853 n=10+10)
XML              404ms ± 4%        409ms ± 5%    ~     (p=0.436 n=10+10)

name       old alloc/op      new alloc/op      delta
Template        34.9MB ± 0%       35.0MB ± 0%  +0.26%  (p=0.000 n=10+10)
Unicode         29.3MB ± 0%       29.3MB ± 0%  +0.02%  (p=0.000 n=10+10)
GoTypes          115MB ± 0%        115MB ± 0%  +0.30%  (p=0.000 n=10+10)
Compiler         519MB ± 0%        521MB ± 0%  +0.30%  (p=0.000 n=10+10)
SSA             1.55GB ± 0%       1.57GB ± 0%  +1.34%  (p=0.000 n=10+9)
Flate           24.1MB ± 0%       24.2MB ± 0%  +0.10%  (p=0.000 n=10+10)
GoParser        28.1MB ± 0%       28.1MB ± 0%  +0.07%  (p=0.000 n=10+10)
Reflect         78.7MB ± 0%       78.7MB ± 0%  +0.03%  (p=0.000 n=8+10)
Tar             34.4MB ± 0%       34.5MB ± 0%  +0.12%  (p=0.000 n=10+10)
XML             43.2MB ± 0%       43.2MB ± 0%  +0.13%  (p=0.000 n=10+10)

name       old allocs/op     new allocs/op     delta
Template          330k ± 0%         330k ± 0%  -0.01%  (p=0.017 n=10+10)
Unicode           337k ± 0%         337k ± 0%  +0.01%  (p=0.000 n=9+10)
GoTypes          1.15M ± 0%        1.15M ± 0%  +0.03%  (p=0.000 n=10+10)
Compiler         4.77M ± 0%        4.77M ± 0%  +0.03%  (p=0.000 n=9+10)
SSA              12.5M ± 0%        12.6M ± 0%  +1.16%  (p=0.000 n=10+10)
Flate             221k ± 0%         221k ± 0%  +0.05%  (p=0.000 n=9+10)
GoParser          275k ± 0%         275k ± 0%  +0.01%  (p=0.014 n=10+9)
Reflect           944k ± 0%         944k ± 0%  -0.02%  (p=0.000 n=10+10)
Tar               324k ± 0%         323k ± 0%  -0.12%  (p=0.000 n=10+10)
XML               384k ± 0%         384k ± 0%  -0.01%  (p=0.001 n=10+10)

name       old object-bytes  new object-bytes  delta
Template         476kB ± 0%        476kB ± 0%  -0.04%  (p=0.000 n=10+10)
Unicode          218kB ± 0%        218kB ± 0%    ~     (all equal)
GoTypes         1.58MB ± 0%       1.58MB ± 0%  -0.04%  (p=0.000 n=10+10)
Compiler        6.25MB ± 0%       6.24MB ± 0%  -0.09%  (p=0.000 n=10+10)
SSA             15.9MB ± 0%       16.1MB ± 0%  +1.22%  (p=0.000 n=10+10)
Flate            304kB ± 0%        304kB ± 0%  -0.13%  (p=0.000 n=10+10)
GoParser         370kB ± 0%        370kB ± 0%  -0.00%  (p=0.000 n=10+10)
Reflect         1.27MB ± 0%       1.27MB ± 0%  -0.12%  (p=0.000 n=10+10)
Tar              421kB ± 0%        419kB ± 0%  -0.64%  (p=0.000 n=10+10)
XML              518kB ± 0%        517kB ± 0%  -0.12%  (p=0.000 n=10+10)

name       old export-bytes  new export-bytes  delta
Template        16.7kB ± 0%       16.7kB ± 0%    ~     (all equal)
Unicode         6.52kB ± 0%       6.52kB ± 0%    ~     (all equal)
GoTypes         29.2kB ± 0%       29.2kB ± 0%    ~     (all equal)
Compiler        88.0kB ± 0%       88.0kB ± 0%    ~     (all equal)
SSA              109kB ± 0%        109kB ± 0%    ~     (all equal)
Flate           4.49kB ± 0%       4.49kB ± 0%    ~     (all equal)
GoParser        8.10kB ± 0%       8.10kB ± 0%    ~     (all equal)
Reflect         7.71kB ± 0%       7.71kB ± 0%    ~     (all equal)
Tar             9.15kB ± 0%       9.15kB ± 0%    ~     (all equal)
XML             12.3kB ± 0%       12.3kB ± 0%    ~     (all equal)

name       old text-bytes    new text-bytes    delta
HelloSize        676kB ± 0%        672kB ± 0%  -0.59%  (p=0.000 n=10+10)
CmdGoSize       7.26MB ± 0%       7.24MB ± 0%  -0.18%  (p=0.000 n=10+10)

name       old data-bytes    new data-bytes    delta
HelloSize       10.2kB ± 0%       10.2kB ± 0%    ~     (all equal)
CmdGoSize        248kB ± 0%        248kB ± 0%    ~     (all equal)

name       old bss-bytes     new bss-bytes     delta
HelloSize        125kB ± 0%        125kB ± 0%    ~     (all equal)
CmdGoSize        145kB ± 0%        145kB ± 0%    ~     (all equal)

name       old exe-bytes     new exe-bytes     delta
HelloSize       1.46MB ± 0%       1.45MB ± 0%  -0.31%  (p=0.000 n=10+10)
CmdGoSize       14.7MB ± 0%       14.7MB ± 0%  -0.17%  (p=0.000 n=10+10)

Change-Id: Ic72b0c189dd542f391e1c9ab88a76e9148dc4285
Reviewed-on: https://go-review.googlesource.com/106495
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment