Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/internal/obj/x86: pad jumps to avoid Intel erratum #35881

Open
rsc opened this issue Nov 27, 2019 · 12 comments
Labels
Milestone

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Nov 27, 2019

Intel erratum SKX102 “Processor May Behave Unpredictably Under Complex Sequence of Conditions Which Involve Branches That Cross 64-Byte Boundaries” applies to:

  • Intel® Celeron® Processor 4000 Series
  • Intel® Celeron® Processor G Series 10th Generation
  • Intel® Core™ i5 Processors 10th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ i3 Processors 6th Generation
  • Intel® Core™ i5 Processors 6th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ m Processors 7th Generation
  • Intel® Core™ i3 Processors 7th Generation
  • Intel® Core™ i5 Processors 7th Generation
  • Intel® Core™ i7 Processors 7th Generation
  • Intel® Core™ m Processors 8th Generation
  • Intel® Core™ i3 Processors 8th Generation
  • Intel® Core™ i5 Processors 8th Generation
  • Intel® Core™ i7 Processors 8th Generation
  • Intel® Core™ m Processors 9th Generation
  • Intel® Core™ i9 Processors
  • Intel® Core™ X-series Processors
  • Intel® Pentium® Gold Processor Series
  • Intel® Pentium® Processor G Series
  • Intel® Xeon® Processor E3 v5 Family
  • Intel® Xeon® Processor E3 v6 Family 2nd Generation
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® E Processor
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® W Processor

There is a microcode fix that can be applied by the BIOS to avoid the incorrect execution. It stops any jump (jump, jcc, call, ret, direct, indirect) from being cached in the decoded icache when the instruction ends at or crosses a 32-byte boundary. Intel says:

Intel has observed performance effects associated with the workaround [microcode fix] ranging from 0-4% on many industry-standard benchmarks. In subcomponents of these benchmarks, Intel has observed outliers higher than the 0-4% range. Other workloads not observed by Intel may behave differently.

The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.

CL 206837 adds a new environment variable to set the padding policy. The original CL used $GO_X86_PADJUMP but the discussion has moved on to using $GOAMD64, which would avoid breaking the build cache.

There are really two questions here:

  • What is the right amount of padding to insert by default?
  • Given that default, what additional control do developers need over the padding?

In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."

If the overheads involved are small enough and we are 100% confident in the padding code, we could stop there and just leave it on unconditionally. It seems like that's what we should do rather than open the door to arbitrary compiler option configuration in $GOAMD64, and all the complexity that comes with it.

So what are the overheads? Here is an estimate.

$ ls -l $(which go)
-rwxr-xr-x  1 rsc  primarygroup  15056484 Nov 15 15:10 /Users/rsc/go/bin/go
$ go tool objdump $(which go) >go.dump
$ grep -c '^TEXT' go.dump
10362
$ cat go.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
251848
$ 

This go command has 10,362 functions, and the padding required for instructions crossing or ending at a 16-byte boundary should average out to 251,848 extra bytes. The awk adds 3 to conditional jumps to simulate fusing of a preceding register-register CMP instruction.

Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.

Same for a smaller binary:

$ ls -l $(which gofmt)
-rwxr-xr-x  1 rsc  primarygroup  3499584 Nov 15 15:09 /Users/rsc/go/bin/gofmt
$ go tool objdump $(which gofmt) >gofmt.dump
$ grep -c '^TEXT' gofmt.dump
2956
$ cat gofmt.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
58636.8
$ 

Changing alignment to 32 would save 29,318.4 bytes but add 47,296 bytes.

Overall, the added bytes are 1.67% in both the go command and gofmt. This is not nothing, but it seems like a small price to pay for correct execution, and if it makes things faster on some systems, even better.

My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?

@rsc rsc added the NeedsDecision label Nov 27, 2019
@rsc rsc added this to the Go1.14 milestone Nov 27, 2019
@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Nov 28, 2019

@rsc thank you for entering the bug and summarising the issue.

The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.

The assembler patches work by ensuring that neither standalone nor macro-fused jumps end on or cross 32 byte boundaries, not 64.

What is the right amount of padding to insert by default?

This is a difficult question to answer as there isn't a single value that is optimal across all architectures. The white paper notes that the current default of 5, may not be optimal for some Atom processors, not affected by the erratum. These processors may take longer to decode instructions that have more than 3 or 4 prefixes.

On the other hand, a default value of 3 is sub-optimal for processors affected by the Erratum as it means that there is less prefix space available for padding. If the instructions that precede a jump do not have sufficient prefix space available to pad that jump, the patch pads the jump with NOPs instead, which are less efficient.

Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.

I modified the patch locally so that it retains the existing 16 byte function alignment but pads jumps so that they do not end on or cross 16 byte boundaries. I found that this actually increases binary size slightly over the original patch which uses 32 byte alignment. So for the Go binary I see binary sizes of 15551238 (16 byte alignment) vs 15436550 (32 byte alignment). For go1.test I see 11800358 (16 byte alignment) vs 11722534 (32 byte alignment).

What is more worrying however, is that I see many more NOPs in the code stream when using 16 byte alignment. So for the go1.test, padding with 16 bytes yields 25001 NOPs (of varying size) where as padding with 32 bytes yields only 12375. This stands to reason really. The patch cannot always use prefixes to pad jumps and when prefixes can't be used, it falls back to NOPs. If we reduce function alignment to 16 we'll increase the number of jumps that need to be padded and, mostly likely, the number of NOPs that need to be inserted. This is likely to hurt performance.

My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?

I think the concerns about enabling the patch by default are:

  1. Increase in binary size
  2. Increase in build times
  3. There does not seem to be an optimal default value for maximum number of prefixes across all architectures.
  4. The patch may impact the performance of those architectures which are not affected by this JCC erratum.
  5. The patch cannot be applied in some scenarios where application behavior is dependent on exact code size. In other words, the inserted padding (prefix, nop) may break the assumption of code size that the programmer has made. Such assumptions have been observed in the compilation of the Linux kernel. Building Linux is not an issue for Go of course, but I can imagine some cases where the patch might break existing assembly code, e.g., JMP 8(PC).

On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Nov 28, 2019

For reference, here are links to the llvm and binutils patches.

@knweiss

This comment has been minimized.

Copy link

@knweiss knweiss commented Nov 29, 2019

In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."

@rsc Correct me if I'm wrong but my understanding is that there are two ways to fix the unpredictable behavior caused by this CPU erratum:

  1. Install the latest Intel CPU microcode updates. This already fixes the unpredictable behavior issue at the expense of a performance penalty for the affected CPUs. Binary sizes will not change.
  2. Avoid the problematic jump instructions by adding (prefix) padding. This also prevents the unpredictable behavior reliably (i.e. the latest microcode version is not required) at the expense of increased binary sizes on all CPUs - affected or not. Additionally, this will improve performance for affected CPUs with the latest microcode version (because the faster decoded-icache mechanism will not be disabled) but cost performance on all other current and future CPUs (e.g. because of larger code footprint in instruction caches). It will also increase binary size for everyone.

I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 2, 2019

@knweiss your text also aligns with my understanding: If the affected CPU has an updated microcode then padding is only a performance improvement for the affected and microcode updated CPU while adding the padding is lower performance (decode bandwidth, icache usage) for all CPUs (unaffected or unpatched).

This looks like a performance tradeoff decision (as e.g. many operating systems load updated microcode automatically even if the bios is not updated) between affected and unaffected CPUs to me.

Is there data to understand how much this helps affected CPUs vs how much this will cause a performance regression in not affected CPUs?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Dec 2, 2019

the patch pads the jump with NOPs instead, which are less efficient.

Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.

Increase in build times

A few percent here is not a big deal.

There does not seem to be an optimal default value for maximum number of prefixes across all architectures.

If the performance characteristics of multiple prefixes vary a lot by processor, I'd rather use the safe number (3?) and do larger padding with NOPs. That way performance is predictable, and we don't need subarch variants.

The patch may impact the performance of those architectures which are not affected by this JCC erratum.

This is my sticking point. This patch will only be less useful over time (assuming Intel has stopped selling the bad chips), and it's not clear even now whether it is a net win. Answers to @knweiss and @martisch 's comments would help here.

The patch cannot be applied in some scenarios where application behavior is dependent on exact code size. In other words, the inserted padding (prefix, nop) may break the assumption of code size that the programmer has made. Such assumptions have been observed in the compilation of the Linux kernel. Building Linux is not an issue for Go of course, but I can imagine some cases where the patch might break existing assembly code, e.g., JMP 8(PC).

I think we need to handle jumps like this correctly. Compute the target of the jump without padding, and adjust the jump if padding is inserted between the jump and target.

On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.

I don't think this is correct. As long as the go tool uses the option as part of the build key (see discussion of using GOAMD64 instead of GO_X86_PADJUMP), it will rebuild the part of the standard library that your application uses with the same options as the application itself.

@rsc

This comment has been minimized.

Copy link
Contributor Author

@rsc rsc commented Dec 2, 2019

Thanks for the good discussion so far. Two small things:

Note that JMP 8(PC) does mean 8 bytes in the text segment. It means 8 instructions forward in the assembly file. Those targets are resolved fairly early in the assembler. You'd probably have to go out of your way to break that when inserting padding into the encoding of individual instructions. (But please don't of course.)

Keith is also correct about a GOAMD64 change causing a complete rebuild when the cache only has objects with different GOAMD64 settings, just as you'd hope.

@knweiss

This comment has been minimized.

Copy link

@knweiss knweiss commented Dec 2, 2019

@randall77 Regarding NOPs:

“I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.“ Source

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 3, 2019

Some more thoughts on the topic:

Go 1.14:
This issue is currently marked as go1.14 however I do not think this change should be made at this point in the cycle as there is a workaround (microcode update) that is needed for other programs to work correctly on the affected CPU at any rate and previous go versions unless back ported will have the same unpredictable behavior on affected CPUs without microcode updates.

Padding and binary size:
Go Gc apart from function alignment has AFAIK not really started to exploit the potential of padding for performance improvements. If e.g. a 2% binary size budget increase is to be used it may be better to apply this towards a generic loop alignment to speed up execution on a much larger selection of amd64 CPUs.

Architecture specific padding:
I think its needed to discuss some general thresholds/criteria of accepting the tradeoffs of binary size vs performance. If this change is to be accepted as the default would we also consider adding NOPs to pad instruction streams to increase performance on Atom architectures?

Adding GOAMD64 options:
We have largely avoided adding tuning flags for amd64 architectures. I can see other options potentially having a much larger effect on amd64 optimization than padding (e.g. assume SSE4, AVX together with additional compiler changes to use these instructions much more broadly). The difference to just different padding here is of course that these options would not allow the resulting binary to run on old hardware. Depending on the instructions chosen to be a given there might however be very few CPU that are actually used often affected. Further options that might have better effects but do not exclude older CPUs may be instruction scheduling and cost of operations.

Maintenance of new compiler options:
The more options are added the more buildbots, benchmarks, tests we will be need to understand and detect that compiler changes do not regress in correctness or performance. This also provides one more option by which Go programs can differ when analyzing bug reports.

Pseudo assembler doing more magic:
A problem that may arise is that if the Go assembler starts inserting NOPs also into assembler the developers have written that this might interfere with careful alignments they may have wanted to be achieved by choosing a specific chain of instructions.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

I've updated the patch in gerrit to improve the prefix counting code and to add a unit test for the prefix counting code.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.

@knweiss This is essentially correct. The Go patch is designed to compensate for the performance effects of the microcode update on Go programs. See section 2.4 of the white paper for more details.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.

@randall77 There's some additional information about the use of NOPs on Intel hardware in section "3.5.1.9 Using NOPs" of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. I will try to gather some data on prefixes vs NOPs in the context of this patch.

I don't think this is correct. As long as the go tool uses the option as part of the build key (see discussion of using GOAMD64 instead of GO_X86_PADJUMP), it will rebuild the part of the standard library that your application uses with the same options as the application itself.

Ah, this is good news.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 4, 2019

Phoronix has some benchmarks for GCC https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode

It seems like adding padding can also further lower performance in benchmarks for affected CPUs with new microcode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.