Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/internal/obj/x86: pad jumps to avoid Intel erratum #35881

Open
rsc opened this issue Nov 27, 2019 · 28 comments
Open

cmd/internal/obj/x86: pad jumps to avoid Intel erratum #35881

rsc opened this issue Nov 27, 2019 · 28 comments
Labels
Milestone

Comments

@rsc
Copy link
Contributor

@rsc rsc commented Nov 27, 2019

Intel erratum SKX102 “Processor May Behave Unpredictably Under Complex Sequence of Conditions Which Involve Branches That Cross 64-Byte Boundaries” applies to:

  • Intel® Celeron® Processor 4000 Series
  • Intel® Celeron® Processor G Series 10th Generation
  • Intel® Core™ i5 Processors 10th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ i3 Processors 6th Generation
  • Intel® Core™ i5 Processors 6th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ m Processors 7th Generation
  • Intel® Core™ i3 Processors 7th Generation
  • Intel® Core™ i5 Processors 7th Generation
  • Intel® Core™ i7 Processors 7th Generation
  • Intel® Core™ m Processors 8th Generation
  • Intel® Core™ i3 Processors 8th Generation
  • Intel® Core™ i5 Processors 8th Generation
  • Intel® Core™ i7 Processors 8th Generation
  • Intel® Core™ m Processors 9th Generation
  • Intel® Core™ i9 Processors
  • Intel® Core™ X-series Processors
  • Intel® Pentium® Gold Processor Series
  • Intel® Pentium® Processor G Series
  • Intel® Xeon® Processor E3 v5 Family
  • Intel® Xeon® Processor E3 v6 Family 2nd Generation
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® E Processor
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® W Processor

There is a microcode fix that can be applied by the BIOS to avoid the incorrect execution. It stops any jump (jump, jcc, call, ret, direct, indirect) from being cached in the decoded icache when the instruction ends at or crosses a 32-byte boundary. Intel says:

Intel has observed performance effects associated with the workaround [microcode fix] ranging from 0-4% on many industry-standard benchmarks. In subcomponents of these benchmarks, Intel has observed outliers higher than the 0-4% range. Other workloads not observed by Intel may behave differently.

The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.

CL 206837 adds a new environment variable to set the padding policy. The original CL used $GO_X86_PADJUMP but the discussion has moved on to using $GOAMD64, which would avoid breaking the build cache.

There are really two questions here:

  • What is the right amount of padding to insert by default?
  • Given that default, what additional control do developers need over the padding?

In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."

If the overheads involved are small enough and we are 100% confident in the padding code, we could stop there and just leave it on unconditionally. It seems like that's what we should do rather than open the door to arbitrary compiler option configuration in $GOAMD64, and all the complexity that comes with it.

So what are the overheads? Here is an estimate.

$ ls -l $(which go)
-rwxr-xr-x  1 rsc  primarygroup  15056484 Nov 15 15:10 /Users/rsc/go/bin/go
$ go tool objdump $(which go) >go.dump
$ grep -c '^TEXT' go.dump
10362
$ cat go.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
251848
$ 

This go command has 10,362 functions, and the padding required for instructions crossing or ending at a 16-byte boundary should average out to 251,848 extra bytes. The awk adds 3 to conditional jumps to simulate fusing of a preceding register-register CMP instruction.

Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.

Same for a smaller binary:

$ ls -l $(which gofmt)
-rwxr-xr-x  1 rsc  primarygroup  3499584 Nov 15 15:09 /Users/rsc/go/bin/gofmt
$ go tool objdump $(which gofmt) >gofmt.dump
$ grep -c '^TEXT' gofmt.dump
2956
$ cat gofmt.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
58636.8
$ 

Changing alignment to 32 would save 29,318.4 bytes but add 47,296 bytes.

Overall, the added bytes are 1.67% in both the go command and gofmt. This is not nothing, but it seems like a small price to pay for correct execution, and if it makes things faster on some systems, even better.

My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?

@rsc rsc added the NeedsDecision label Nov 27, 2019
@rsc rsc added this to the Go1.14 milestone Nov 27, 2019
@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Nov 28, 2019

@rsc thank you for entering the bug and summarising the issue.

The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.

The assembler patches work by ensuring that neither standalone nor macro-fused jumps end on or cross 32 byte boundaries, not 64.

What is the right amount of padding to insert by default?

This is a difficult question to answer as there isn't a single value that is optimal across all architectures. The white paper notes that the current default of 5, may not be optimal for some Atom processors, not affected by the erratum. These processors may take longer to decode instructions that have more than 3 or 4 prefixes.

On the other hand, a default value of 3 is sub-optimal for processors affected by the Erratum as it means that there is less prefix space available for padding. If the instructions that precede a jump do not have sufficient prefix space available to pad that jump, the patch pads the jump with NOPs instead, which are less efficient.

Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.

I modified the patch locally so that it retains the existing 16 byte function alignment but pads jumps so that they do not end on or cross 16 byte boundaries. I found that this actually increases binary size slightly over the original patch which uses 32 byte alignment. So for the Go binary I see binary sizes of 15551238 (16 byte alignment) vs 15436550 (32 byte alignment). For go1.test I see 11800358 (16 byte alignment) vs 11722534 (32 byte alignment).

What is more worrying however, is that I see many more NOPs in the code stream when using 16 byte alignment. So for the go1.test, padding with 16 bytes yields 25001 NOPs (of varying size) where as padding with 32 bytes yields only 12375. This stands to reason really. The patch cannot always use prefixes to pad jumps and when prefixes can't be used, it falls back to NOPs. If we reduce function alignment to 16 we'll increase the number of jumps that need to be padded and, mostly likely, the number of NOPs that need to be inserted. This is likely to hurt performance.

My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?

I think the concerns about enabling the patch by default are:

  1. Increase in binary size
  2. Increase in build times
  3. There does not seem to be an optimal default value for maximum number of prefixes across all architectures.
  4. The patch may impact the performance of those architectures which are not affected by this JCC erratum.
  5. The patch cannot be applied in some scenarios where application behavior is dependent on exact code size. In other words, the inserted padding (prefix, nop) may break the assumption of code size that the programmer has made. Such assumptions have been observed in the compilation of the Linux kernel. Building Linux is not an issue for Go of course, but I can imagine some cases where the patch might break existing assembly code, e.g., JMP 8(PC).

On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Nov 28, 2019

For reference, here are links to the llvm and binutils patches.

@knweiss

This comment has been minimized.

Copy link

@knweiss knweiss commented Nov 29, 2019

In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."

@rsc Correct me if I'm wrong but my understanding is that there are two ways to fix the unpredictable behavior caused by this CPU erratum:

  1. Install the latest Intel CPU microcode updates. This already fixes the unpredictable behavior issue at the expense of a performance penalty for the affected CPUs. Binary sizes will not change.
  2. Avoid the problematic jump instructions by adding (prefix) padding. This also prevents the unpredictable behavior reliably (i.e. the latest microcode version is not required) at the expense of increased binary sizes on all CPUs - affected or not. Additionally, this will improve performance for affected CPUs with the latest microcode version (because the faster decoded-icache mechanism will not be disabled) but cost performance on all other current and future CPUs (e.g. because of larger code footprint in instruction caches). It will also increase binary size for everyone.

I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 2, 2019

@knweiss your text also aligns with my understanding: If the affected CPU has an updated microcode then padding is only a performance improvement for the affected and microcode updated CPU while adding the padding is lower performance (decode bandwidth, icache usage) for all CPUs (unaffected or unpatched).

This looks like a performance tradeoff decision (as e.g. many operating systems load updated microcode automatically even if the bios is not updated) between affected and unaffected CPUs to me.

Is there data to understand how much this helps affected CPUs vs how much this will cause a performance regression in not affected CPUs?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Dec 2, 2019

the patch pads the jump with NOPs instead, which are less efficient.

Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.

Increase in build times

A few percent here is not a big deal.

There does not seem to be an optimal default value for maximum number of prefixes across all architectures.

If the performance characteristics of multiple prefixes vary a lot by processor, I'd rather use the safe number (3?) and do larger padding with NOPs. That way performance is predictable, and we don't need subarch variants.

The patch may impact the performance of those architectures which are not affected by this JCC erratum.

This is my sticking point. This patch will only be less useful over time (assuming Intel has stopped selling the bad chips), and it's not clear even now whether it is a net win. Answers to @knweiss and @martisch 's comments would help here.

The patch cannot be applied in some scenarios where application behavior is dependent on exact code size. In other words, the inserted padding (prefix, nop) may break the assumption of code size that the programmer has made. Such assumptions have been observed in the compilation of the Linux kernel. Building Linux is not an issue for Go of course, but I can imagine some cases where the patch might break existing assembly code, e.g., JMP 8(PC).

I think we need to handle jumps like this correctly. Compute the target of the jump without padding, and adjust the jump if padding is inserted between the jump and target.

On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.

I don't think this is correct. As long as the go tool uses the option as part of the build key (see discussion of using GOAMD64 instead of GO_X86_PADJUMP), it will rebuild the part of the standard library that your application uses with the same options as the application itself.

@rsc

This comment has been minimized.

Copy link
Contributor Author

@rsc rsc commented Dec 2, 2019

Thanks for the good discussion so far. Two small things:

Note that JMP 8(PC) does mean 8 bytes in the text segment. It means 8 instructions forward in the assembly file. Those targets are resolved fairly early in the assembler. You'd probably have to go out of your way to break that when inserting padding into the encoding of individual instructions. (But please don't of course.)

Keith is also correct about a GOAMD64 change causing a complete rebuild when the cache only has objects with different GOAMD64 settings, just as you'd hope.

@knweiss

This comment has been minimized.

Copy link

@knweiss knweiss commented Dec 2, 2019

@randall77 Regarding NOPs:

“I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.“ Source

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 3, 2019

Some more thoughts on the topic:

Go 1.14:
This issue is currently marked as go1.14 however I do not think this change should be made at this point in the cycle as there is a workaround (microcode update) that is needed for other programs to work correctly on the affected CPU at any rate and previous go versions unless back ported will have the same unpredictable behavior on affected CPUs without microcode updates.

Padding and binary size:
Go Gc apart from function alignment has AFAIK not really started to exploit the potential of padding for performance improvements. If e.g. a 2% binary size budget increase is to be used it may be better to apply this towards a generic loop alignment to speed up execution on a much larger selection of amd64 CPUs.

Architecture specific padding:
I think its needed to discuss some general thresholds/criteria of accepting the tradeoffs of binary size vs performance. If this change is to be accepted as the default would we also consider adding NOPs to pad instruction streams to increase performance on Atom architectures?

Adding GOAMD64 options:
We have largely avoided adding tuning flags for amd64 architectures. I can see other options potentially having a much larger effect on amd64 optimization than padding (e.g. assume SSE4, AVX together with additional compiler changes to use these instructions much more broadly). The difference to just different padding here is of course that these options would not allow the resulting binary to run on old hardware. Depending on the instructions chosen to be a given there might however be very few CPU that are actually used often affected. Further options that might have better effects but do not exclude older CPUs may be instruction scheduling and cost of operations.

Maintenance of new compiler options:
The more options are added the more buildbots, benchmarks, tests we will be need to understand and detect that compiler changes do not regress in correctness or performance. This also provides one more option by which Go programs can differ when analyzing bug reports.

Pseudo assembler doing more magic:
A problem that may arise is that if the Go assembler starts inserting NOPs also into assembler the developers have written that this might interfere with careful alignments they may have wanted to be achieved by choosing a specific chain of instructions.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

I've updated the patch in gerrit to improve the prefix counting code and to add a unit test for the prefix counting code.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.

@knweiss This is essentially correct. The Go patch is designed to compensate for the performance effects of the microcode update on Go programs. See section 2.4 of the white paper for more details.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 3, 2019

Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.

@randall77 There's some additional information about the use of NOPs on Intel hardware in section "3.5.1.9 Using NOPs" of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. I will try to gather some data on prefixes vs NOPs in the context of this patch.

I don't think this is correct. As long as the go tool uses the option as part of the build key (see discussion of using GOAMD64 instead of GO_X86_PADJUMP), it will rebuild the part of the standard library that your application uses with the same options as the application itself.

Ah, this is good news.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Dec 4, 2019

Phoronix has some benchmarks for GCC https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode

It seems like adding padding can also further lower performance in benchmarks for affected CPUs with new microcode.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 17, 2019

Here are some benchmarks results that illustrate the effects of the microcode update and the software mitigation on Go programs.

Unless otherwise noted, all benchmarks were generated with either a local build of master at #99957b6 or with version 3 of the patch applied on top of #99957b6.

Test System

  • Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  • Ubuntu 16.04
  • Turbo disabled
  • 64GiB RAM
  • Intel HyperThreading enabled
  • Microcode w/o microcode update: 0x200005e
  • Microcode with microcode update: 0x2000065
  • Testing performed on Nov 19th 2019, 3rd, 4th, 9th and 16th of Dec 2019
  • CGO_ENABLED=0 for go1.test

Effects of the Go Software Mitigation on Build Time and Binary Size

In our tests, the software mitigation can impact the Go compiler between 0 – 8%. Results for the Go compiler benchmark suite are presented below. These results slightly favor the results “new-time/op” column generated by the patched Go compiler as they were run on a machine on which the microcode update had been applied. The table shows the increase in time taken to compile a set of packages from the Go standard library.

name                      old time/op     new time/op     delta
Template                      284ms ± 6%      298ms ± 3%  +5.02%  (p=0.001 n=10+10)
Unicode                       135ms ± 4%      139ms ± 3%  +3.01%  (p=0.008 n=8+9)
GoTypes                       842ms ± 2%      887ms ± 2%  +5.37%  (p=0.000 n=10+10)
Compiler                      3.57s ± 0%      3.78s ± 0%  +5.95%  (p=0.000 n=9+10)
SSA                           11.4s ± 0%      12.3s ± 0%  +7.68%  (p=0.000 n=10+10)
Flate                         197ms ± 2%      206ms ± 3%  +4.83%  (p=0.000 n=8+9)
GoParser                      239ms ± 4%      246ms ± 4%  +2.80%  (p=0.019 n=9+9)
Reflect                       549ms ± 3%      570ms ± 2%  +3.81%  (p=0.000 n=10+10)
Tar                           255ms ± 7%      258ms ±11%    ~     (p=0.853 n=10+10)
XML                           330ms ± 2%      344ms ± 3%  +4.03%  (p=0.000 n=10+10)
LinkCompiler                  958ms ± 2%      936ms ± 1%  -2.27%  (p=0.000 n=9+9)
ExternalLinkCompiler          2.03s ± 1%      2.03s ± 2%    ~     (p=0.739 n=10+10)
LinkWithoutDebugCompiler      584ms ± 4%      574ms ± 1%    ~     (p=0.243 n=10+9)
StdCmd                        13.5s ± 1%      13.8s ± 1%  +2.22%  (p=0.000 n=9+10)
[Geo mean]                    800ms           824ms       +2.99%

name                      old user-ns/op  new user-ns/op  delta
Template                       485M ± 5%       476M ± 5%    ~     (p=0.190 n=10+10)
Unicode                        257M ±17%       253M ±18%    ~     (p=0.971 n=10+10)
GoTypes                       1.46G ± 2%      1.53G ± 8%    ~     (p=0.065 n=9+10)
Compiler                      6.33G ± 1%      6.37G ± 2%    ~     (p=0.065 n=9+10)
SSA                           18.4G ± 6%      19.1G ± 1%  +3.58%  (p=0.000 n=10+10)
Flate                          304M ± 6%       309M ± 7%    ~     (p=0.579 n=10+10)
GoParser                       359M ± 8%       353M ± 7%    ~     (p=0.222 n=9+9)
Reflect                        949M ± 2%       920M ± 4%  -3.12%  (p=0.002 n=10+10)
Tar                            398M ± 6%       387M ±15%    ~     (p=0.529 n=10+10)
XML                            532M ± 3%       520M ± 4%  -2.26%  (p=0.031 n=9+9)
LinkCompiler                  1.44G ±12%      1.33G ± 3%  -7.74%  (p=0.016 n=10+8)
ExternalLinkCompiler          2.32G ± 7%      2.25G ± 8%    ~     (p=0.190 n=10+10)
LinkWithoutDebugCompiler       853M ±15%       778M ±12%  -8.73%  (p=0.035 n=10+10)
[Geo mean]                    1.02G           1.00G       -1.74%

name                      old text-bytes  new text-bytes  delta
HelloSize                      800k ± 0%       818k ± 0%  +2.24%  (p=0.000 n=10+10)
CmdGoSize                     10.9M ± 0%      11.1M ± 0%  +1.82%  (p=0.000 n=10+10)
[Geo mean]                    2.95M           3.01M       +2.03%

name                      old data-bytes  new data-bytes  delta
HelloSize                     13.3k ± 0%      13.3k ± 0%    ~     (all equal)
CmdGoSize                      319k ± 0%       319k ± 0%    ~     (all equal)
[Geo mean]                    65.2k           65.2k       +0.00%

name                      old bss-bytes   new bss-bytes   delta
HelloSize                      114k ± 0%       114k ± 0%    ~     (all equal)
CmdGoSize                      138k ± 0%       138k ± 0%    ~     (all equal)
[Geo mean]                     125k            125k       +0.00%

name                      old exe-bytes   new exe-bytes   delta
HelloSize                     1.20M ± 0%      1.21M ± 0%  +1.37%  (p=0.000 n=10+10)
CmdGoSize                     15.2M ± 0%      15.4M ± 0%  +1.37%  (p=0.000 n=10+10)
[Geo mean]                    4.27M           4.33M       +1.37%

In our tests, Go binaries built with the software mitigation show size increases of between 1.4% and 2.0%. Here are some examples,

Binary Name Size without patch Size with patch Delta
Go 15227654 15436550 +1.4%
go1.test 11558694 11722534 +1.4%

Effects of the Go software mitigation on Go benchmarks

The effect of the Go software mitigation on the performance of generated binaries on machines without the microcode update applied is shown below.

name                      old time/op    new time/op    delta
BinaryTree17-20              2.40s ± 2%     2.41s ± 1%    ~     (p=0.146 n=10+8)
Fannkuch11-20                2.31s ± 0%     2.29s ± 0%  -0.87%  (p=0.000 n=10+9)
FmtFprintfEmpty-20          34.8ns ± 0%    34.7ns ± 0%  -0.53%  (p=0.000 n=10+9)
FmtFprintfString-20         58.8ns ± 0%    62.0ns ± 2%  +5.40%  (p=0.000 n=8+10)
FmtFprintfInt-20            65.6ns ± 1%    65.0ns ± 3%    ~     (p=0.159 n=9+10)
FmtFprintfIntInt-20          102ns ± 0%      98ns ± 1%  -3.02%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-20     114ns ± 1%     111ns ± 0%  -2.20%  (p=0.000 n=10+8)
FmtFprintfFloat-20           180ns ± 0%     182ns ± 1%  +1.06%  (p=0.000 n=9+10)
FmtManyArgs-20               431ns ± 0%     435ns ± 1%  +0.81%  (p=0.000 n=10+10)
GobDecode-20                4.52ms ± 1%    4.50ms ± 1%  -0.40%  (p=0.043 n=10+10)
GobEncode-20                3.65ms ± 2%    3.66ms ± 1%    ~     (p=0.393 n=10+10)
Gzip-20                      182ms ± 1%     184ms ± 1%  +1.04%  (p=0.000 n=9+10)
Gunzip-20                   28.5ms ± 0%    28.8ms ± 0%  +1.18%  (p=0.000 n=10+10)
HTTPClientServer-20          119µs ± 2%     118µs ± 1%    ~     (p=0.123 n=10+10)
JSONEncode-20               7.41ms ± 0%    7.33ms ± 1%  -1.06%  (p=0.000 n=8+10)
JSONDecode-20               32.6ms ± 1%    32.8ms ± 0%  +0.87%  (p=0.000 n=9+8)
Mandelbrot200-20            3.91ms ± 0%    3.91ms ± 0%    ~     (p=0.063 n=10+10)
GoParse-20                  2.70ms ± 0%    2.70ms ± 0%    ~     (p=0.114 n=8+9)
RegexpMatchEasy0_32-20      54.4ns ± 3%    56.6ns ± 0%  +3.99%  (p=0.000 n=8+9)
RegexpMatchEasy0_1K-20       151ns ± 0%     147ns ± 0%  -2.23%  (p=0.000 n=9+10)
RegexpMatchEasy1_32-20      52.3ns ± 5%    49.5ns ± 1%  -5.20%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K-20       247ns ± 1%     249ns ± 1%  +0.75%  (p=0.003 n=8+10)
RegexpMatchMedium_32-20     5.04ns ± 0%    5.02ns ± 4%    ~     (p=0.188 n=6+10)
RegexpMatchMedium_1K-20     25.7µs ± 0%    26.1µs ± 2%  +1.50%  (p=0.001 n=8+10)
RegexpMatchHard_32-20       1.20µs ± 0%    1.22µs ± 2%  +2.09%  (p=0.000 n=10+10)
RegexpMatchHard_1K-20       36.3µs ± 0%    36.3µs ± 0%    ~     (p=0.956 n=10+10)
Revcomp-20                   342ms ± 1%     323ms ± 1%  -5.50%  (p=0.000 n=9+9)
Template-20                 43.2ms ± 1%    42.5ms ± 1%  -1.65%  (p=0.000 n=10+9)
TimeParse-20                 258ns ± 0%     264ns ± 0%  +2.52%  (p=0.000 n=8+9)
TimeFormat-20                247ns ± 0%     261ns ± 0%  +5.67%  (p=0.000 n=7+9)
[Geo mean]                  34.1µs         34.1µs       +0.07%

name                      old speed      new speed      delta
GobDecode-20               170MB/s ± 1%   171MB/s ± 1%  +0.40%  (p=0.042 n=10+10)
GobEncode-20               210MB/s ± 1%   210MB/s ± 1%    ~     (p=0.604 n=9+10)
Gzip-20                    107MB/s ± 1%   106MB/s ± 1%  -1.02%  (p=0.000 n=9+10)
Gunzip-20                  682MB/s ± 0%   674MB/s ± 0%  -1.17%  (p=0.000 n=10+10)
JSONEncode-20              262MB/s ± 0%   265MB/s ± 1%  +1.08%  (p=0.000 n=8+10)
JSONDecode-20             59.6MB/s ± 1%  59.1MB/s ± 0%  -0.86%  (p=0.000 n=9+8)
GoParse-20                21.5MB/s ± 0%  21.4MB/s ± 0%    ~     (p=0.097 n=8+9)
RegexpMatchEasy0_32-20     588MB/s ± 3%   565MB/s ± 0%  -3.86%  (p=0.000 n=8+9)
RegexpMatchEasy0_1K-20    6.79GB/s ± 0%  6.95GB/s ± 1%  +2.31%  (p=0.000 n=9+10)
RegexpMatchEasy1_32-20     613MB/s ± 5%   646MB/s ± 1%  +5.47%  (p=0.000 n=9+9)
RegexpMatchEasy1_1K-20    4.14GB/s ± 1%  4.11GB/s ± 1%  -0.69%  (p=0.002 n=8+10)
RegexpMatchMedium_32-20    198MB/s ± 0%   199MB/s ± 4%    ~     (p=0.185 n=7+10)
RegexpMatchMedium_1K-20   39.8MB/s ± 0%  39.3MB/s ± 2%  -1.45%  (p=0.001 n=8+10)
RegexpMatchHard_32-20     26.7MB/s ± 0%  26.2MB/s ± 2%  -2.04%  (p=0.000 n=10+10)
RegexpMatchHard_1K-20     28.2MB/s ± 0%  28.2MB/s ± 0%    ~     (p=0.982 n=10+10)
Revcomp-20                 743MB/s ± 1%   786MB/s ± 1%  +5.81%  (p=0.000 n=9+9)
Template-20               44.9MB/s ± 1%  45.7MB/s ± 1%  +1.78%  (p=0.000 n=10+10)
[Geo mean]                 203MB/s        204MB/s       +0.32%

Effects of the Microcode Update on Go benchmarks

Most benchmarks fall into the 0-4% range. However, there are outliers.

name                      old time/op    new time/op    delta
BinaryTree17-20              2.40s ± 2%     2.61s ± 2%   +8.62%  (p=0.000 n=10+9)
Fannkuch11-20                2.31s ± 0%     2.58s ± 0%  +11.61%  (p=0.000 n=10+10)
FmtFprintfEmpty-20          34.8ns ± 0%    35.4ns ± 2%   +1.58%  (p=0.000 n=10+10)
FmtFprintfString-20         58.8ns ± 0%    60.1ns ± 2%   +2.32%  (p=0.000 n=8+10)
FmtFprintfInt-20            65.6ns ± 1%    66.8ns ± 0%   +1.86%  (p=0.000 n=9+9)
FmtFprintfIntInt-20          102ns ± 0%     105ns ± 1%   +3.74%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-20     114ns ± 1%     113ns ± 1%     ~     (p=0.721 n=10+10)
FmtFprintfFloat-20           180ns ± 0%     179ns ± 0%   -0.56%  (p=0.000 n=9+9)
FmtManyArgs-20               431ns ± 0%     456ns ± 0%   +5.80%  (p=0.000 n=10+10)
GobDecode-20                4.52ms ± 1%    4.50ms ± 0%     ~     (p=0.063 n=10+10)
GobEncode-20                3.65ms ± 2%    3.66ms ± 1%     ~     (p=0.065 n=10+9)
Gzip-20                      182ms ± 1%     187ms ± 1%   +2.97%  (p=0.000 n=9+10)
Gunzip-20                   28.5ms ± 0%    30.0ms ± 0%   +5.42%  (p=0.000 n=10+10)
HTTPClientServer-20          119µs ± 2%     122µs ± 3%   +2.97%  (p=0.001 n=10+10)
JSONEncode-20               7.41ms ± 0%    7.85ms ± 2%   +5.85%  (p=0.000 n=8+10)
JSONDecode-20               32.6ms ± 1%    34.2ms ± 1%   +4.97%  (p=0.000 n=9+10)
Mandelbrot200-20            3.91ms ± 0%    3.91ms ± 0%   +0.12%  (p=0.000 n=10+10)
GoParse-20                  2.70ms ± 0%    2.71ms ± 1%   +0.38%  (p=0.015 n=8+9)
RegexpMatchEasy0_32-20      54.4ns ± 3%    55.0ns ± 0%   +1.08%  (p=0.012 n=8+10)
RegexpMatchEasy0_1K-20       151ns ± 0%     149ns ± 0%   -1.11%  (p=0.000 n=9+8)
RegexpMatchEasy1_32-20      52.3ns ± 5%    48.9ns ± 1%   -6.42%  (p=0.000 n=9+9)
RegexpMatchEasy1_1K-20       247ns ± 1%     255ns ± 1%   +3.04%  (p=0.000 n=8+9)
RegexpMatchMedium_32-20     5.04ns ± 0%    4.94ns ± 0%   -2.00%  (p=0.000 n=6+10)
RegexpMatchMedium_1K-20     25.7µs ± 0%    28.8µs ± 2%  +11.97%  (p=0.000 n=8+10)
RegexpMatchHard_32-20       1.20µs ± 0%    1.35µs ± 0%  +12.62%  (p=0.000 n=10+9)
RegexpMatchHard_1K-20       36.3µs ± 0%    41.0µs ± 1%  +12.82%  (p=0.000 n=10+10)
Revcomp-20                   342ms ± 1%     408ms ± 0%  +19.20%  (p=0.000 n=9+9)
Template-20                 43.2ms ± 1%    43.5ms ± 2%     ~     (p=0.143 n=10+10)
TimeParse-20                 258ns ± 0%     265ns ± 0%   +3.03%  (p=0.000 n=8+10)
TimeFormat-20                247ns ± 0%     264ns ± 0%   +6.88%  (p=0.001 n=7+6)
[Geo mean]                  34.1µs         35.4µs        +3.84%

name                      old speed      new speed      delta
GobDecode-20               170MB/s ± 1%   171MB/s ± 0%     ~     (p=0.063 n=10+10)
GobEncode-20               210MB/s ± 1%   210MB/s ± 1%     ~     (p=0.113 n=9+9)
Gzip-20                    107MB/s ± 1%   104MB/s ± 1%   -2.88%  (p=0.000 n=9+10)
Gunzip-20                  682MB/s ± 0%   647MB/s ± 0%   -5.14%  (p=0.000 n=10+10)
JSONEncode-20              262MB/s ± 0%   247MB/s ± 2%   -5.52%  (p=0.000 n=8+10)
JSONDecode-20             59.6MB/s ± 1%  56.8MB/s ± 1%   -4.73%  (p=0.000 n=9+10)
GoParse-20                21.5MB/s ± 0%  21.4MB/s ± 1%   -0.37%  (p=0.021 n=8+9)
RegexpMatchEasy0_32-20     588MB/s ± 3%   581MB/s ± 0%   -1.11%  (p=0.012 n=8+10)
RegexpMatchEasy0_1K-20    6.79GB/s ± 0%  6.88GB/s ± 0%   +1.28%  (p=0.000 n=9+10)
RegexpMatchEasy1_32-20     613MB/s ± 5%   655MB/s ± 1%   +6.80%  (p=0.000 n=9+9)
RegexpMatchEasy1_1K-20    4.14GB/s ± 1%  4.02GB/s ± 1%   -2.93%  (p=0.000 n=8+9)
RegexpMatchMedium_32-20    198MB/s ± 0%   203MB/s ± 0%   +2.09%  (p=0.000 n=7+10)
RegexpMatchMedium_1K-20   39.8MB/s ± 0%  35.6MB/s ± 2%  -10.67%  (p=0.000 n=8+10)
RegexpMatchHard_32-20     26.7MB/s ± 0%  23.7MB/s ± 0%  -11.22%  (p=0.000 n=10+9)
RegexpMatchHard_1K-20     28.2MB/s ± 0%  25.0MB/s ± 1%  -11.36%  (p=0.000 n=10+10)
Revcomp-20                 743MB/s ± 1%   623MB/s ± 0%  -16.11%  (p=0.000 n=9+9)
Template-20               44.9MB/s ± 1%  44.6MB/s ± 2%     ~     (p=0.128 n=10+10)
[Geo mean]                 203MB/s        196MB/s        -3.84%

Effectiveness of the Go Software Mitigation

Here we see benchmarks compiled without the Go software mitigation vs benchmarks compiled with the software mitigation on a machine with the new microcode update.

name                      old time/op    new time/op    delta
BinaryTree17-20              2.61s ± 2%     2.44s ± 3%   -6.66%  (p=0.000 n=9+10)
Fannkuch11-20                2.58s ± 0%     2.29s ± 0%  -11.17%  (p=0.000 n=10+8)
FmtFprintfEmpty-20          35.4ns ± 2%    34.7ns ± 1%   -2.02%  (p=0.000 n=10+8)
FmtFprintfString-20         60.1ns ± 2%    60.1ns ± 0%     ~     (p=0.276 n=10+8)
FmtFprintfInt-20            66.8ns ± 0%    64.3ns ± 2%   -3.64%  (p=0.000 n=9+8)
FmtFprintfIntInt-20          105ns ± 1%      98ns ± 0%   -6.93%  (p=0.000 n=10+8)
FmtFprintfPrefixedInt-20     113ns ± 1%     111ns ± 0%   -2.03%  (p=0.000 n=10+8)
FmtFprintfFloat-20           179ns ± 0%     182ns ± 0%   +1.45%  (p=0.000 n=9+10)
FmtManyArgs-20               456ns ± 0%     426ns ± 0%   -6.67%  (p=0.000 n=10+7)
GobDecode-20                4.50ms ± 0%    4.49ms ± 1%     ~     (p=0.237 n=10+8)
GobEncode-20                3.66ms ± 1%    3.67ms ± 1%     ~     (p=0.400 n=9+10)
Gzip-20                      187ms ± 1%     184ms ± 1%   -1.97%  (p=0.000 n=10+10)
Gunzip-20                   30.0ms ± 0%    28.8ms ± 0%   -4.00%  (p=0.000 n=10+10)
HTTPClientServer-20          122µs ± 3%     122µs ± 5%     ~     (p=0.853 n=10+10)
JSONEncode-20               7.85ms ± 2%    7.37ms ± 1%   -6.09%  (p=0.000 n=10+9)
JSONDecode-20               34.2ms ± 1%    32.9ms ± 1%   -3.63%  (p=0.000 n=10+9)
Mandelbrot200-20            3.91ms ± 0%    3.91ms ± 0%   -0.17%  (p=0.000 n=10+8)
GoParse-20                  2.71ms ± 1%    2.71ms ± 1%     ~     (p=0.315 n=9+10)
RegexpMatchEasy0_32-20      55.0ns ± 0%    57.0ns ± 1%   +3.60%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K-20       149ns ± 0%     147ns ± 0%   -1.34%  (p=0.001 n=8+9)
RegexpMatchEasy1_32-20      48.9ns ± 1%    50.2ns ± 4%   +2.56%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K-20       255ns ± 1%     250ns ± 1%   -1.91%  (p=0.000 n=9+10)
RegexpMatchMedium_32-20     4.94ns ± 0%    4.94ns ± 0%     ~     (p=0.552 n=10+6)
RegexpMatchMedium_1K-20     28.8µs ± 2%    26.2µs ± 5%   -8.84%  (p=0.000 n=10+9)
RegexpMatchHard_32-20       1.35µs ± 0%    1.22µs ± 0%   -9.25%  (p=0.000 n=9+7)
RegexpMatchHard_1K-20       41.0µs ± 1%    37.0µs ± 1%   -9.67%  (p=0.000 n=10+8)
Revcomp-20                   408ms ± 0%     323ms ± 1%  -20.72%  (p=0.000 n=9+9)
Template-20                 43.5ms ± 2%    42.8ms ± 1%   -1.54%  (p=0.001 n=10+10)
TimeParse-20                 265ns ± 0%     264ns ± 0%   -0.49%  (p=0.000 n=10+6)
TimeFormat-20                264ns ± 0%     261ns ± 0%   -1.14%  (p=0.008 n=6+7)
[Geo mean]                  35.4µs         34.2µs        -3.54%

name                      old speed      new speed      delta
GobDecode-20               171MB/s ± 0%   171MB/s ± 1%     ~     (p=0.246 n=10+8)
GobEncode-20               210MB/s ± 1%   209MB/s ± 1%     ~     (p=0.400 n=9+10)
Gzip-20                    104MB/s ± 1%   106MB/s ± 1%   +2.01%  (p=0.000 n=10+10)
Gunzip-20                  647MB/s ± 0%   674MB/s ± 0%   +4.16%  (p=0.000 n=10+10)
JSONEncode-20              247MB/s ± 2%   263MB/s ± 1%   +6.48%  (p=0.000 n=10+9)
JSONDecode-20             56.8MB/s ± 1%  58.9MB/s ± 1%   +3.76%  (p=0.000 n=10+9)
GoParse-20                21.4MB/s ± 1%  21.3MB/s ± 1%     ~     (p=0.305 n=9+10)
RegexpMatchEasy0_32-20     581MB/s ± 0%   561MB/s ± 1%   -3.42%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K-20    6.88GB/s ± 0%  6.95GB/s ± 0%   +1.06%  (p=0.000 n=10+9)
RegexpMatchEasy1_32-20     655MB/s ± 1%   638MB/s ± 4%   -2.48%  (p=0.000 n=9+10)
RegexpMatchEasy1_1K-20    4.02GB/s ± 1%  4.10GB/s ± 1%   +2.03%  (p=0.000 n=9+10)
RegexpMatchMedium_32-20    203MB/s ± 0%   202MB/s ± 0%     ~     (p=0.592 n=10+10)
RegexpMatchMedium_1K-20   35.6MB/s ± 2%  39.0MB/s ± 4%   +9.73%  (p=0.000 n=10+9)
RegexpMatchHard_32-20     23.7MB/s ± 0%  26.1MB/s ± 0%  +10.24%  (p=0.000 n=9+8)
RegexpMatchHard_1K-20     25.0MB/s ± 1%  27.7MB/s ± 1%  +10.70%  (p=0.000 n=10+8)
Revcomp-20                 623MB/s ± 0%   786MB/s ± 1%  +26.14%  (p=0.000 n=9+9)
Template-20               44.6MB/s ± 2%  45.3MB/s ± 1%   +1.56%  (p=0.001 n=10+10)
[Geo mean]                 196MB/s        203MB/s        +4.01%

Here we compare the results of the benchmark suite built with an unpatched Go compiler running on old microcode versus benchmarks compiled with the software mitigation and run on the new microcode.

name                      old time/op    new time/op    delta
BinaryTree17-20              2.40s ± 2%     2.44s ± 3%  +1.39%  (p=0.007 n=10+10)
Fannkuch11-20                2.31s ± 0%     2.29s ± 0%  -0.86%  (p=0.000 n=10+8)
FmtFprintfEmpty-20          34.8ns ± 0%    34.7ns ± 1%  -0.47%  (p=0.005 n=10+8)
FmtFprintfString-20         58.8ns ± 0%    60.1ns ± 0%  +2.25%  (p=0.000 n=8+8)
FmtFprintfInt-20            65.6ns ± 1%    64.3ns ± 2%  -1.84%  (p=0.003 n=9+8)
FmtFprintfIntInt-20          102ns ± 0%      98ns ± 0%  -3.45%  (p=0.000 n=10+8)
FmtFprintfPrefixedInt-20     114ns ± 1%     111ns ± 0%  -2.20%  (p=0.000 n=10+8)
FmtFprintfFloat-20           180ns ± 0%     182ns ± 0%  +0.89%  (p=0.000 n=9+10)
FmtManyArgs-20               431ns ± 0%     426ns ± 0%  -1.26%  (p=0.000 n=10+7)
GobDecode-20                4.52ms ± 1%    4.49ms ± 1%  -0.57%  (p=0.006 n=10+8)
GobEncode-20                3.65ms ± 2%    3.67ms ± 1%  +0.81%  (p=0.005 n=10+10)
Gzip-20                      182ms ± 1%     184ms ± 1%  +0.94%  (p=0.000 n=9+10)
Gunzip-20                   28.5ms ± 0%    28.8ms ± 0%  +1.20%  (p=0.000 n=10+10)
HTTPClientServer-20          119µs ± 2%     122µs ± 5%  +2.86%  (p=0.011 n=10+10)
JSONEncode-20               7.41ms ± 0%    7.37ms ± 1%  -0.60%  (p=0.008 n=8+9)
JSONDecode-20               32.6ms ± 1%    32.9ms ± 1%  +1.16%  (p=0.000 n=9+9)
Mandelbrot200-20            3.91ms ± 0%    3.91ms ± 0%    ~     (p=0.083 n=10+8)
GoParse-20                  2.70ms ± 0%    2.71ms ± 1%  +0.58%  (p=0.012 n=8+10)
RegexpMatchEasy0_32-20      54.4ns ± 3%    57.0ns ± 1%  +4.72%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K-20       151ns ± 0%     147ns ± 0%  -2.43%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-20      52.3ns ± 5%    50.2ns ± 4%  -4.03%  (p=0.001 n=9+10)
RegexpMatchEasy1_1K-20       247ns ± 1%     250ns ± 1%  +1.07%  (p=0.000 n=8+10)
RegexpMatchMedium_32-20     5.04ns ± 0%    4.94ns ± 0%  -1.98%  (p=0.002 n=6+6)
RegexpMatchMedium_1K-20     25.7µs ± 0%    26.2µs ± 5%  +2.07%  (p=0.000 n=8+9)
RegexpMatchHard_32-20       1.20µs ± 0%    1.22µs ± 0%  +2.21%  (p=0.000 n=10+7)
RegexpMatchHard_1K-20       36.3µs ± 0%    37.0µs ± 1%  +1.91%  (p=0.000 n=10+8)
Revcomp-20                   342ms ± 1%     323ms ± 1%  -5.50%  (p=0.000 n=9+9)
Template-20                 43.2ms ± 1%    42.8ms ± 1%  -0.99%  (p=0.002 n=10+10)
TimeParse-20                 258ns ± 0%     264ns ± 0%  +2.52%  (p=0.001 n=8+6)
TimeFormat-20                247ns ± 0%     261ns ± 0%  +5.67%  (p=0.001 n=7+7)
[Geo mean]                  34.1µs         34.2µs       +0.17%

name                      old speed      new speed      delta
GobDecode-20               170MB/s ± 1%   171MB/s ± 1%  +0.57%  (p=0.006 n=10+8)
GobEncode-20               210MB/s ± 1%   209MB/s ± 1%  -0.61%  (p=0.010 n=9+10)
Gzip-20                    107MB/s ± 1%   106MB/s ± 1%  -0.93%  (p=0.000 n=9+10)
Gunzip-20                  682MB/s ± 0%   674MB/s ± 0%  -1.19%  (p=0.000 n=10+10)
JSONEncode-20              262MB/s ± 0%   263MB/s ± 1%  +0.61%  (p=0.008 n=8+9)
JSONDecode-20             59.6MB/s ± 1%  58.9MB/s ± 1%  -1.15%  (p=0.000 n=9+9)
GoParse-20                21.5MB/s ± 0%  21.3MB/s ± 1%  -0.57%  (p=0.009 n=8+10)
RegexpMatchEasy0_32-20     588MB/s ± 3%   561MB/s ± 1%  -4.49%  (p=0.000 n=8+10)
RegexpMatchEasy0_1K-20    6.79GB/s ± 0%  6.95GB/s ± 0%  +2.36%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-20     613MB/s ± 5%   638MB/s ± 4%  +4.14%  (p=0.001 n=9+10)
RegexpMatchEasy1_1K-20    4.14GB/s ± 1%  4.10GB/s ± 1%  -0.95%  (p=0.001 n=8+10)
RegexpMatchMedium_32-20    198MB/s ± 0%   202MB/s ± 0%  +2.05%  (p=0.000 n=7+10)
RegexpMatchMedium_1K-20   39.8MB/s ± 0%  39.0MB/s ± 4%  -1.97%  (p=0.000 n=8+9)
RegexpMatchHard_32-20     26.7MB/s ± 0%  26.1MB/s ± 0%  -2.13%  (p=0.000 n=10+8)
RegexpMatchHard_1K-20     28.2MB/s ± 0%  27.7MB/s ± 1%  -1.88%  (p=0.000 n=10+8)
Revcomp-20                 743MB/s ± 1%   786MB/s ± 1%  +5.81%  (p=0.000 n=9+9)
Template-20               44.9MB/s ± 1%  45.3MB/s ± 1%  +1.00%  (p=0.002 n=10+10)
[Geo mean]                 203MB/s        203MB/s       +0.01%

Comparing maximum of 5 prefixes to NOPs only

The next set of benchmarks were all run on the test machine with the microcode update. The results for the second column were generated by the Go software mitigation that permits a maximum of 5 prefixes per instructions. The results for the 3rd column are generated by a version of this mitigation that uses only NOPs to pad jumps.

name                      old time/op    new time/op    delta
BinaryTree17-20              2.44s ± 3%     2.43s ± 2%    ~     (p=0.604 n=10+9)
Fannkuch11-20                2.29s ± 0%     2.29s ± 0%    ~     (p=0.321 n=8+9)
FmtFprintfEmpty-20          34.7ns ± 1%    35.3ns ± 5%  +1.90%  (p=0.001 n=8+9)
FmtFprintfString-20         60.1ns ± 0%    61.9ns ± 4%  +2.94%  (p=0.000 n=8+9)
FmtFprintfInt-20            64.3ns ± 2%    65.5ns ± 2%  +1.72%  (p=0.002 n=8+10)
FmtFprintfIntInt-20         98.0ns ± 0%    98.8ns ± 0%  +0.85%  (p=0.000 n=8+8)
FmtFprintfPrefixedInt-20     111ns ± 0%     111ns ± 0%    ~     (all equal)
FmtFprintfFloat-20           182ns ± 0%     183ns ± 0%  +0.99%  (p=0.000 n=10+10)
FmtManyArgs-20               426ns ± 0%     430ns ± 0%  +0.86%  (p=0.000 n=7+8)
GobDecode-20                4.49ms ± 1%    4.52ms ± 1%  +0.69%  (p=0.000 n=8+9)
GobEncode-20                3.67ms ± 1%    3.68ms ± 1%    ~     (p=0.912 n=10+10)
Gzip-20                      184ms ± 1%     189ms ± 0%  +2.71%  (p=0.000 n=10+9)
Gunzip-20                   28.8ms ± 0%    28.8ms ± 0%    ~     (p=0.481 n=10+10)
HTTPClientServer-20          122µs ± 5%     120µs ± 6%    ~     (p=0.105 n=10+10)
JSONEncode-20               7.37ms ± 1%    7.50ms ± 1%  +1.86%  (p=0.000 n=9+10)
JSONDecode-20               32.9ms ± 1%    33.1ms ± 0%    ~     (p=0.050 n=9+9)
Mandelbrot200-20            3.91ms ± 0%    3.92ms ± 0%  +0.41%  (p=0.000 n=8+10)
GoParse-20                  2.71ms ± 1%    2.71ms ± 1%    ~     (p=0.905 n=10+9)
RegexpMatchEasy0_32-20      57.0ns ± 1%    56.6ns ± 1%  -0.78%  (p=0.010 n=10+9)
RegexpMatchEasy0_1K-20       147ns ± 0%     148ns ± 0%  +0.68%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-20      50.2ns ± 4%    50.0ns ± 3%    ~     (p=0.387 n=10+9)
RegexpMatchEasy1_1K-20       250ns ± 1%     251ns ± 3%    ~     (p=0.870 n=10+9)
RegexpMatchMedium_32-20     4.94ns ± 0%    4.94ns ± 0%    ~     (p=0.442 n=6+10)
RegexpMatchMedium_1K-20     26.2µs ± 5%    25.8µs ± 0%  -1.86%  (p=0.000 n=9+10)
RegexpMatchHard_32-20       1.22µs ± 0%    1.22µs ± 0%    ~     (p=0.081 n=7+9)
RegexpMatchHard_1K-20       37.0µs ± 1%    37.0µs ± 1%    ~     (p=0.721 n=8+8)
Revcomp-20                   323ms ± 1%     321ms ± 1%  -0.78%  (p=0.008 n=9+10)
Template-20                 42.8ms ± 1%    43.1ms ± 1%  +0.67%  (p=0.023 n=10+10)
TimeParse-20                 264ns ± 0%     266ns ± 0%  +0.76%  (p=0.001 n=6+8)
TimeFormat-20                261ns ± 0%     262ns ± 0%  +0.38%  (p=0.000 n=7+8)
[Geo mean]                  34.2µs         34.3µs       +0.40%

name                      old speed      new speed      delta
GobDecode-20               171MB/s ± 1%   170MB/s ± 1%  -0.69%  (p=0.000 n=8+9)
GobEncode-20               209MB/s ± 1%   209MB/s ± 1%    ~     (p=0.912 n=10+10)
Gzip-20                    106MB/s ± 1%   103MB/s ± 0%  -2.64%  (p=0.000 n=10+9)
Gunzip-20                  674MB/s ± 0%   673MB/s ± 0%    ~     (p=0.481 n=10+10)
JSONEncode-20              263MB/s ± 1%   259MB/s ± 1%  -1.82%  (p=0.000 n=9+10)
JSONDecode-20             58.9MB/s ± 1%  58.7MB/s ± 0%  -0.43%  (p=0.048 n=9+9)
GoParse-20                21.3MB/s ± 1%  21.3MB/s ± 2%    ~     (p=0.753 n=10+10)
RegexpMatchEasy0_32-20     561MB/s ± 1%   566MB/s ± 1%  +0.75%  (p=0.017 n=10+9)
RegexpMatchEasy0_1K-20    6.95GB/s ± 0%  6.93GB/s ± 0%  -0.37%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-20     638MB/s ± 4%   640MB/s ± 3%    ~     (p=0.400 n=10+9)
RegexpMatchEasy1_1K-20    4.10GB/s ± 1%  4.08GB/s ± 3%    ~     (p=0.604 n=10+9)
RegexpMatchMedium_32-20    202MB/s ± 0%   203MB/s ± 0%    ~     (p=0.404 n=10+10)
RegexpMatchMedium_1K-20   39.0MB/s ± 4%  39.8MB/s ± 0%  +1.84%  (p=0.000 n=9+10)
RegexpMatchHard_32-20     26.1MB/s ± 0%  26.2MB/s ± 0%    ~     (p=0.132 n=8+9)
RegexpMatchHard_1K-20     27.7MB/s ± 1%  27.7MB/s ± 1%    ~     (p=0.701 n=8+8)
Revcomp-20                 786MB/s ± 1%   792MB/s ± 1%  +0.79%  (p=0.008 n=9+10)
Template-20               45.3MB/s ± 1%  45.0MB/s ± 1%  -0.66%  (p=0.022 n=10+10)
[Geo mean]                 203MB/s        203MB/s       -0.20%

For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks. For specific information and notices/disclaimers
regarding the Jump Conditional Code Erratum, visit
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 17, 2019

I've updated the patch to use GOAMD64 to enable and disable the jump alignment code.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Dec 18, 2019

Note that JMP 8(PC) does mean 8 bytes in the text segment. It means 8 instructions forward in the assembly file. Those targets are resolved fairly early in the assembler. You'd probably have to go out of your way to break that when inserting padding into the encoding of individual instructions. (But please don't of course.)

In the end I couldn't resist the challenge and I did try to break JMP n(PC), but predictably, I failed. I contrived a small assembly language function with a JMP 2(PC) instruction that was directly followed by an unconditional jump. If the JMP 2(PC) instruction works correctly, this second jump instruction should be skipped. NOPs were used at the start of the function to ensure that the second jump instruction ended on a 32 byte boundary and would then require padding with the patch enabled.

Here's what the disassembly looks like with the patch disabled.

  4f81e0:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)  <- NOPs to force alignment of 2nd jump
  4f81e6:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  4f81ec:       66 90                   xchg   %ax,%ax
  4f81ee:       48 31 ff                xor    %rdi,%rdi
  4f81f1:       48 31 c0                xor    %rax,%rax
  4f81f4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
  4f81f9:       48 ff c0                inc    %rax
  4f81fc:       eb 02                   jmp    4f8200         <-  JMP 2(PC)
  4f81fe:       eb 08                   jmp    4f8208         <-  JMP we want to skip
  4f8200:       48 ff c7                inc    %rdi
  4f8203:       48 39 d7                cmp    %rdx,%rdi
  4f8206:       7c f1                   jl     4f81f9 
  4f8208:       48 89 44 24 10          mov    %rax,0x10(%rsp)
  4f820d:       c3                      retq   
  4f820e:       cc                      int3   
  4f820f:       cc                      int3

And here's what it looks like with the patch enabled.

  502fe0:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  502fe6:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  502fec:       66 90                   xchg   %ax,%ax
  502fee:       48 31 ff                xor    %rdi,%rdi
  502ff1:       48 31 c0                xor    %rax,%rax
  502ff4:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
  502ff9:       48 ff c0                inc    %rax
  502ffc:       eb 04                   jmp    503002          <- JMP 2(PC) note second byte is now 4
  502ffe:       66 90                   xchg   %ax,%ax         <- 2 byte NOP inserted by the patch
  503000:       eb 08                   jmp    50300a          <- Jump we want to skip
  503002:       48 ff c7                inc    %rdi
  503005:       48 39 d7                cmp    %rdx,%rdi
  503008:       7c ef                   jl     502ff9 
  50300a:       48 89 44 24 10          mov    %rax,0x10(%rsp)
  50300f:       c3                      retq   

Note that there is now a 2 byte NOP between the two jump instructions and that the second jump starts at a new 32 byte boundary. Note also that the target of the first jump, the JMP 2(PC) instruction, has correctly been incremented by 2 bytes (it's encoded as eb 04 instead of eb 02), so the second jump is still skipped and everything works as expected.

The only way I could get this to fail was to replace the JMP 2(PC) instruction with the byte sequence of a positive relative jump of two bytes directly in the assembly code, e.g.,

BYTE $0xeb
BYTE $0x02

Note this is the encoding of the first jump in the first example. When I did this I found that the second jump was skipped when the patch was disabled but executed when the patch was enabled, as the first jump instruction was not updated. With the patch enabled we then skip only the inserted 2 byte NOP instead of the NOP and the second jump. I can't imagine why anyone would want to write assembly code like this, given we have labels and JMP 2(PC), or even whether this sort of thing is officially supported. I mention it however, for completeness.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Jan 16, 2020

@rsc I've updated the patch to use GOAMD64 and uploaded some benchmark data that show the effects of the patch with and without the microcode update. I also double checked to make sure that the patch doesn't break JMP PC(n) instructions, which it doesn't, although it can be fooled by jumps inserted into assembly code using BYTE statements. Is there now enough information available to make a decision on this bug?

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 16, 2020

The additional benchmark I'd like to see is before and after the Go padding CL, on a chip that doesn't have the underlying bug. On an AMD chip, for example.

Some more info:

From here:

Below is the LLVM test suite we measured including the performance, code size and build time.
The data indicates some performance effect (1.7%) from the microcode update, which was reduced to 0.5% with the SW mitigation of prefix padding. The code size increase in test suite is ~0.5%. And the compile time increase is ~2%.
Comparing with hw_sw_prefix and hw_sw_nop, the exec_time difference is within -0.5%~0.5%, which may be a within the margin of error.
Comparing with hw_sw_prefix and hw_sw_prefix_align_all, the exec_time difference is even less at 0.1%.
Given that LLVM test-suite is a relatively small benchmark, we do not conclude which padding is preferable, hw_sw_prefix, hw_sw_nop or hw_sw_prefix_align_all.

The padding buys us back 1.2% in performance for a cost of 0.5% code size.
There's only weak evidence that prefixes are better than nops.

From here:

Comparing with the hw_sw_prefix (prefix padding) with hw_sw_nop (nop padding) of SW mitigation, the hw_sw_prefix can provide better performance (0.3%~0.5% in geomean). In individual cases, we have observed a 1.4% performance improvement in prefix padding vs. nop padding. Comparing with sw_prefix with sw_nop on a system w/o MCU, we observed 0.7% better performance in sw_prefix.
We also measured the increase in code size due to the padding to instructions to align branches correctly (Table 4). The geomean code size increase is 2-3% in both prefix padding and nop padding, with the individual outliers up to 4%.

This set of benchmarks shows prefixes are better than nops by 0.3-0.5%. They are paying 2-3% in code size.

In all of the LLVM discussion, I didn't see any mention of performance cost on non-affected chips. Maybe that's something they don't have to deal with, but we do.

The Go benchmarks posted by @markdryan show a performance improvement of 3.5% for a space cost of 1.4-2%. That seems quite a bit better than the C benchmarks were demonstrating. Not sure why that would be. (Maybe Go code is more branchy?)

My opinion is we should just do nothing. This is tricky (see all the discussion in here about how this breaks debuggers in various ways), and the performance deltas just aren't that large.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Jan 30, 2020

The additional benchmark I'd like to see is before and after the Go padding CL, on a chip that doesn't have the underlying bug. On an AMD chip, for example.

I have re-run the benchmarks on a Haswell Macbook (Core i7-4870HQ), the results of which I will post below. Some benchmarks are also available here although these were generated using an earlier version of the patch and it’s not clear what CPU was used. Finally, the second set of benchmarks posted above show the effect of the Go software mitigation on a machine which is affected by the Erratum but does not have the microcode update applied.

My opinion is we should just do nothing. This is tricky (see all the discussion in here about how this breaks debuggers in various ways), and the performance deltas just aren't that large.

It’s important to note that the software patch doesn’t just reduce the impact of the microcode update on performance. It also reduces the impact of the microcode update on consistency of performance, from one build to another. Consider this code scenario. Function “A” which contains a macro-fused jump that implements a tight loop. This jump does not cross or end on a 32-byte boundary and incurs no performance penalty on a machine that is affected by the Erratum and that is running the microcode update. A new function, “B”, is then added to the program. B is not related to A in anyway, in that it is not invoked directly or indirectly by A, nor does it invoke A. The linker however happens to place B before A in the final binary. This changes the alignment of A, such that its macro fused jump now crosses a 32-byte boundary, which in turn may degrade function A’s performance. Relatedly, without the software mitigation, performance effects, such as this, may accompany updates to a development environment, e.g., a compiler).

This is not just a hypothetical concern. Experiments conducted with the Revcomp (Go) benchmark can demonstrate this variability, or more simply, how to prevent it. In the benchmark data provided above, you will see that Revcomp built at commit #99957b6 and tested on a system with the microcode update, had a 19% performance effect in result. However, this loss can be eliminated with an unpatched compiler, simply by adding a few functions to the benchmark’s executable. Since function addition (or removal) unrelated to the Revcomp function (i.e. not invoked by Revcomp) can change its alignment (and thus benchmark performance), software mitigation is needed on machines affected by the Erratum to mitigate against this effect of the microcode update on code performance predictability.

Therefore, without a software mitigation it could be more difficult to rely on benchmark data for comparative purposes, when this data is generated on machines affected by the Erratum that have the microcode update applied. One won’t know whether performance gains from a new optimization is due to skillful programming or because added code happens to have displaced a critical jump, given the potentially significant run to run variability.

What is the specific issue with debuggers? I did follow the link but didn’t see any direct discussion of debuggers, although I am not familiar with LLVM so I may have missed it. Is this an issue with prefix padding, NOP padding or both? I could provide a version of the patch that uses only NOPs to pad jumps. Note that the NOP only patch, like the prefix padding patch, relies on an existing mechanism in the assembler for inserting NOPs into the code stream, although it doesn't seem to be used currently (since the NACL code was removed).

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Jan 30, 2020

Impact of the Go software mitigation on a machine unaffected by the Erratum

Test System

  • MacBook Pro (Retina, 15-inch, Mid 2015)
  • Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  • Mojave 10.14.6
  • 16GiB RAM
  • Intel HyperThreading enabled
  • machdep.cpu.microcode_version: 27
  • Testing performed on 21st of Jan 2019
  • CGO_ENABLED=0 for go1.test

Impact of prefix padding on machines unaffected by the Erratum

The following tables show the effect of the Go software mitigation that uses prefixes to pad jumps on a machine that is not affected by the Erratum. The second column in the tables below represent the benchmark results achieved using a build of master at commit #99957b6. The third column shows the results obtained when the software mitigation is applied on top of commit #99957b6. The patch reduced the geomean run time for the time based tests by 0.24% (range -6.9% to +2.68%) and lowered the geomean throughput for the throughput tests by 0.21% (“range -2.06% to 0%).

name                     old time/op    new time/op    delta
BinaryTree17-8              2.34s ± 1%     2.34s ± 1%    ~     (p=0.340 n=9+9)
Fannkuch11-8                2.19s ± 0%     2.09s ± 0%  -4.66%  (p=0.000 n=10+10)
FmtFprintfEmpty-8          36.5ns ± 2%    36.9ns ± 2%  +1.04%  (p=0.023 n=10+10)
FmtFprintfString-8         62.2ns ± 4%    61.5ns ± 3%    ~     (p=0.158 n=10+10)
FmtFprintfInt-8            62.8ns ± 2%    62.7ns ± 0%    ~     (p=0.617 n=10+8)
FmtFprintfIntInt-8         95.2ns ± 1%    97.7ns ± 1%  +2.68%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-8     120ns ± 1%     112ns ± 0%  -6.90%  (p=0.000 n=10+7)
FmtFprintfFloat-8           174ns ± 0%     173ns ± 0%  -0.75%  (p=0.000 n=7+10)
FmtManyArgs-8               409ns ± 0%     416ns ± 0%  +1.85%  (p=0.000 n=9+9)
GobDecode-8                4.56ms ± 1%    4.59ms ± 0%  +0.70%  (p=0.000 n=10+8)
GobEncode-8                3.66ms ± 1%    3.65ms ± 1%    ~     (p=0.497 n=9+10)
Gzip-8                      185ms ± 1%     185ms ± 0%    ~     (p=0.912 n=10+10)
Gunzip-8                   28.8ms ± 1%    28.8ms ± 0%    ~     (p=0.684 n=10+10)
HTTPClientServer-8         91.7µs ±10%    89.7µs ± 4%    ~     (p=0.497 n=10+9)
JSONEncode-8               7.42ms ± 1%    7.45ms ± 1%  +0.41%  (p=0.014 n=9+9)
JSONDecode-8               33.0ms ± 0%    32.9ms ± 1%    ~     (p=0.105 n=10+10)
Mandelbrot200-8            3.63ms ± 1%    3.64ms ± 0%    ~     (p=0.278 n=10+9)
GoParse-8                  2.65ms ± 1%    2.69ms ± 1%  +1.54%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      57.1ns ± 1%    57.2ns ± 1%    ~     (p=0.286 n=9+9)
RegexpMatchEasy0_1K-8       151ns ± 3%     152ns ± 1%    ~     (p=0.129 n=10+10)
RegexpMatchEasy1_32-8      52.7ns ± 1%    52.8ns ± 0%    ~     (p=0.652 n=10+8)
RegexpMatchEasy1_1K-8       251ns ± 2%     251ns ± 2%    ~     (p=0.814 n=10+10)
RegexpMatchMedium_32-8     4.96ns ± 1%    4.97ns ± 0%    ~     (p=0.138 n=10+10)
RegexpMatchMedium_1K-8     24.6µs ± 0%    25.1µs ± 1%  +2.10%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.25µs ± 6%    1.23µs ± 0%    ~     (p=0.479 n=10+9)
RegexpMatchHard_1K-8       37.2µs ± 4%    37.1µs ± 0%    ~     (p=0.150 n=10+9)
Revcomp-8                   327ms ± 1%     328ms ± 0%    ~     (p=0.222 n=9+9)
Template-8                 42.6ms ± 1%    42.5ms ± 1%    ~     (p=0.400 n=9+10)
TimeParse-8                 260ns ± 0%     255ns ± 0%  -2.04%  (p=0.000 n=10+8)
TimeFormat-8                255ns ± 0%     260ns ± 0%  +2.18%  (p=0.000 n=9+9)
[Geo mean]                 33.7µs         33.6µs       -0.24%

name                     old speed      new speed      delta
GobDecode-8               168MB/s ± 1%   167MB/s ± 0%  -0.69%  (p=0.000 n=10+8)
GobEncode-8               210MB/s ± 1%   210MB/s ± 1%    ~     (p=0.497 n=9+10)
Gzip-8                    105MB/s ± 1%   105MB/s ± 0%    ~     (p=0.895 n=10+10)
Gunzip-8                  674MB/s ± 1%   674MB/s ± 0%    ~     (p=0.643 n=10+10)
JSONEncode-8              262MB/s ± 1%   261MB/s ± 1%  -0.41%  (p=0.014 n=9+9)
JSONDecode-8             58.8MB/s ± 0%  58.9MB/s ± 1%    ~     (p=0.101 n=10+10)
GoParse-8                21.9MB/s ± 1%  21.5MB/s ± 1%  -1.52%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8     561MB/s ± 1%   559MB/s ± 1%    ~     (p=0.182 n=9+10)
RegexpMatchEasy0_1K-8    6.78GB/s ± 3%  6.74GB/s ± 1%    ~     (p=0.143 n=10+10)
RegexpMatchEasy1_32-8     607MB/s ± 1%   607MB/s ± 0%    ~     (p=0.985 n=10+9)
RegexpMatchEasy1_1K-8    4.08GB/s ± 2%  4.09GB/s ± 2%    ~     (p=0.631 n=10+10)
RegexpMatchMedium_32-8    201MB/s ± 0%   201MB/s ± 0%    ~     (p=0.197 n=10+10)
RegexpMatchMedium_1K-8   41.6MB/s ± 0%  40.8MB/s ± 1%  -2.06%  (p=0.000 n=10+10)
RegexpMatchHard_32-8     25.6MB/s ± 6%  26.0MB/s ± 0%    ~     (p=0.482 n=10+9)
RegexpMatchHard_1K-8     27.5MB/s ± 4%  27.6MB/s ± 0%    ~     (p=0.147 n=10+9)
Revcomp-8                 777MB/s ± 1%   776MB/s ± 0%    ~     (p=0.222 n=9+9)
Template-8               45.6MB/s ± 1%  45.7MB/s ± 1%    ~     (p=0.367 n=9+10)
[Geo mean]                203MB/s        202MB/s       -0.21%

Impact of NOP padding on machines unaffected by the Erratum

The following tables show the effect of the Go software mitigation that uses only NOPs to pad jumps on a machine that is not affected by the Erratum. The second column in the tables below represent the benchmark results achieved using a build of master at commit #99957b6. The third column shows the results obtained when the NOP only version of the software mitigation is applied on top of commit #99957b6. The patch increased the geomean run time for the time based tests by 0.29% (range -6.25% to +4.85%) and lowered the geomean throughput for the throughput tests by 0.80% (“range -4.58% to 0.80%).

name                     old time/op    new time/op    delta
BinaryTree17-8              2.34s ± 1%     2.39s ± 1%  +2.05%  (p=0.000 n=9+9)
Fannkuch11-8                2.19s ± 0%     2.11s ± 1%  -3.89%  (p=0.000 n=10+10)
FmtFprintfEmpty-8          36.5ns ± 2%    36.5ns ± 1%    ~     (p=0.697 n=10+10)
FmtFprintfString-8         62.2ns ± 4%    61.1ns ± 4%    ~     (p=0.071 n=10+10)
FmtFprintfInt-8            62.8ns ± 2%    63.0ns ± 0%    ~     (p=0.384 n=10+9)
FmtFprintfIntInt-8         95.2ns ± 1%    96.6ns ± 1%  +1.50%  (p=0.000 n=10+9)
FmtFprintfPrefixedInt-8     120ns ± 1%     113ns ± 1%  -6.25%  (p=0.000 n=10+9)
FmtFprintfFloat-8           174ns ± 0%     175ns ± 0%  +0.57%  (p=0.001 n=7+7)
FmtManyArgs-8               409ns ± 0%     416ns ± 1%  +1.86%  (p=0.000 n=9+10)
GobDecode-8                4.56ms ± 1%    4.60ms ± 0%  +0.86%  (p=0.000 n=10+10)
GobEncode-8                3.66ms ± 1%    3.63ms ± 0%  -0.80%  (p=0.000 n=9+9)
Gzip-8                      185ms ± 1%     185ms ± 0%    ~     (p=0.243 n=10+9)
Gunzip-8                   28.8ms ± 1%    29.0ms ± 1%  +0.67%  (p=0.007 n=10+10)
HTTPClientServer-8         91.7µs ±10%    91.1µs ± 4%    ~     (p=0.780 n=10+9)
JSONEncode-8               7.42ms ± 1%    7.54ms ± 1%  +1.63%  (p=0.000 n=9+9)
JSONDecode-8               33.0ms ± 0%    32.8ms ± 0%  -0.66%  (p=0.000 n=10+8)
Mandelbrot200-8            3.63ms ± 1%    3.62ms ± 0%    ~     (p=0.447 n=10+9)
GoParse-8                  2.65ms ± 1%    2.70ms ± 1%  +2.04%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      57.1ns ± 1%    57.7ns ± 1%  +1.20%  (p=0.000 n=9+10)
RegexpMatchEasy0_1K-8       151ns ± 3%     152ns ± 2%    ~     (p=0.088 n=10+9)
RegexpMatchEasy1_32-8      52.7ns ± 1%    55.3ns ± 1%  +4.85%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K-8       251ns ± 2%     257ns ± 2%  +2.51%  (p=0.003 n=10+10)
RegexpMatchMedium_32-8     4.96ns ± 1%    4.95ns ± 1%  -0.28%  (p=0.039 n=10+10)
RegexpMatchMedium_1K-8     24.6µs ± 0%    25.1µs ± 0%  +1.96%  (p=0.000 n=10+9)
RegexpMatchHard_32-8       1.25µs ± 6%    1.24µs ± 0%    ~     (p=0.468 n=10+10)
RegexpMatchHard_1K-8       37.2µs ± 4%    37.2µs ± 0%    ~     (p=0.156 n=10+9)
Revcomp-8                   327ms ± 1%     327ms ± 1%    ~     (p=0.136 n=9+9)
Template-8                 42.6ms ± 1%    42.6ms ± 1%    ~     (p=0.604 n=9+10)
TimeParse-8                 260ns ± 0%     258ns ± 0%  -1.04%  (p=0.000 n=10+10)
TimeFormat-8                255ns ± 0%     262ns ± 0%  +2.83%  (p=0.000 n=9+9)
[Geo mean]                 33.7µs         33.8µs       +0.29%

name                     old speed      new speed      delta
GobDecode-8               168MB/s ± 1%   167MB/s ± 0%  -0.85%  (p=0.000 n=10+10)
GobEncode-8               210MB/s ± 1%   212MB/s ± 0%  +0.80%  (p=0.000 n=9+9)
Gzip-8                    105MB/s ± 1%   105MB/s ± 0%    ~     (p=0.234 n=10+9)
Gunzip-8                  674MB/s ± 1%   670MB/s ± 1%  -0.67%  (p=0.007 n=10+10)
JSONEncode-8              262MB/s ± 1%   257MB/s ± 1%  -1.61%  (p=0.000 n=9+9)
JSONDecode-8             58.8MB/s ± 0%  59.1MB/s ± 0%  +0.66%  (p=0.000 n=10+8)
GoParse-8                21.9MB/s ± 1%  21.4MB/s ± 1%  -2.00%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8     561MB/s ± 1%   554MB/s ± 1%  -1.22%  (p=0.000 n=9+10)
RegexpMatchEasy0_1K-8    6.78GB/s ± 3%  6.72GB/s ± 2%    ~     (p=0.113 n=10+9)
RegexpMatchEasy1_32-8     607MB/s ± 1%   579MB/s ± 1%  -4.58%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K-8    4.08GB/s ± 2%  3.98GB/s ± 2%  -2.36%  (p=0.004 n=10+10)
RegexpMatchMedium_32-8    201MB/s ± 0%   202MB/s ± 0%  +0.36%  (p=0.008 n=10+9)
RegexpMatchMedium_1K-8   41.6MB/s ± 0%  40.8MB/s ± 0%  -1.93%  (p=0.000 n=10+9)
RegexpMatchHard_32-8     25.6MB/s ± 6%  25.9MB/s ± 0%    ~     (p=0.470 n=10+10)
RegexpMatchHard_1K-8     27.5MB/s ± 4%  27.5MB/s ± 0%    ~     (p=0.150 n=10+9)
Revcomp-8                 777MB/s ± 1%   778MB/s ± 1%    ~     (p=0.136 n=9+9)
Template-8               45.6MB/s ± 1%  45.6MB/s ± 1%    ~     (p=0.618 n=9+10)
[Geo mean]                203MB/s        201MB/s       -0.80%

For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks. For specific information and notices/disclaimers
regarding the Jump Conditional Code Erratum, visit
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Jan 30, 2020

It’s important to note that the software patch doesn’t just reduce the impact of the microcode update on performance. It also reduces the impact of the microcode update on consistency of performance, from one build to another. Without a software mitigation, it could be difficult to rely on benchmark data for comparative purposes.

As far as I understand cache alignment/loop stream decoder block alignment effects not only happen for non-padded jumps (we have had this happened on earlier versions of go and not affected chips already) and aligning functions on 32 byte boundaries instead of 16 bytes alone would solve many of these without requiring padded jumps. Just aligning functions on 32byte generally sound like a good idea and is likely to have positive effects even on chips not effected by the erratum.

Which leaves the remaining question if padded jumps are needed in addition.

@tandr

This comment has been minimized.

Copy link

@tandr tandr commented Jan 30, 2020

Curious - how padding on 32 byte boundary would affect both on-disk binary and in-memory code sizes?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 30, 2020

It’s important to note that the software patch doesn’t just reduce the impact of the microcode update on performance. It also reduces the impact of the microcode update on consistency of performance, from one build to another. Without a software mitigation, it could be difficult to rely on benchmark data for comparative purposes.

This is a general problem with Go on x86 processors. We frequently see changes in performance from one build to another. It would be nice to address that in general, but I'm a bit skeptical that this particular approach will fix all such problems.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Jan 31, 2020

This is a general problem with Go on x86 processors. We frequently see changes in performance from one build to another. It would be nice to address that in general, but I'm a bit skeptical that this particular approach will fix all such problems.

The problem of varying performance, from build to build, is likely to get worse on machines affected by the Erratum, once the microcode update is applied. The patch discussed in this issue doesn’t attempt to solve the general problem of varying performance. It only attempts to mitigate against the effect of the microcode update on build to build benchmark consistency on machines affected by the Erratum. I have changed the wording of my previous post to better reflect this.

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Feb 5, 2020

Which leaves the remaining question if padded jumps are needed in addition.

Aligning functions on 32 byte boundaries would prevent the build to build consistency problems caused by the microcode update on machines affected by the Erratum in the one scenario I described above. However, this scenario is just one example of how the microcode update can affect both performance, and consistency of performance, from one build to another. There are others. A seemingly innocuous change to a function or a minor update to a user’s development environment, that pulls in a small update ( to an inlinable function, for example) could change the alignment of jumps in the user’s code. On machines affected by the Erratum and running the microcode update, there’s an increased risk that such changes will have a performance effect. Padding is required on those machines to mitigate against this increased risk.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Feb 5, 2020

I agree that the effects you describe can happen and cause performance regressions. I think on the other side there are also effects that happen just because of the padding. A jump that might have previously aligned now is misaligned. Code that previously would have fit into x cache lines now takes x+1 cache lines. So there are performance regressions that purely happen due to the padding. So the tradeoff is between adding the padding, added options/code and having regressions and having regressions in other places.

If we are purely going to do it to stabilize performance then it should not be an option but the default as that is what I expect the majority of Go users to run. Otherwise if users are turning on and off the option to see performance effect they will also measure the effects of code not padded by jumps now being aligned differently. However turning on the option for all cpus also applies padding to cpus that dont need it and change their performance (for better or worse).

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 13, 2020

Change https://golang.org/cl/219357 mentions this issue: cmd/internal/obj/x86: prevent jumps crossing 32 byte boundaries

@markdryan

This comment has been minimized.

Copy link
Contributor

@markdryan markdryan commented Feb 13, 2020

I've uploaded a version of the patch that only uses NOPs to pad jumps. Prefixes are not used.

@dr2chase

This comment has been minimized.

Copy link
Contributor

@dr2chase dr2chase commented Feb 13, 2020

Just so everyone knows, the penalty can be quite bad, as reported in #37190.

I reproduced this,

  • comparing 1.13 (which was just lucky),
  • two versions of aligned,
  • and unpadded
name \ time/op    Go-1.13     Go-1.14-vzu-align  Go-1.14-vzu-nopalign  Go-1.14-vzu
FastTest2KB-4     141ns ± 2%  115ns ± 2%         112ns ± 0%            269ns ± 1%

Note that alignment improves the best case, so the best-to-worst slowdown exceeds 100% when things line up just so.

For that set of benchmarks (excluding those affected by a not-fully-mitigated MacOS bug):

name \ time/op    Go-1.13     Go-1.14-vzu-align  Go-1.14-vzu-nopalign  Go-1.14-vzu
[Geo mean]        54.9µs      53.1µs             53.3µs                55.1µs

The benchmarks were those in https://github.com/dr2chase/bent

In another benchmark run, I also checked the size and performance costs of 16 vs 32-byte alignment; we want 32-byte alignment, 16 gives 0.7% bigger text and 0.82% slower geomean execution, with almost no winners in the run-time column.

For reference, the two benchmark configurations:

[[Configurations]]
  Name = "Go-1.14-vzeroupper-nopalign-32-lessf2i-nopreempt"
  Root = "$HOME/work/go-quick/"
  GcEnv = ["GOAMD64=alignedjumps"]
  RunEnv = ["GODEBUG=asyncpreemptoff=1"]

[[Configurations]]
  Name = "Go-1.14-vzeroupper-nopalign-16-lessf2i-nopreempt"
  Root = "$HOME/work/go/"
  GcEnv = ["GOAMD64=alignedjumps"]
  RunEnv = ["GODEBUG=asyncpreemptoff=1"]

and git diff in go:

diff --git a/src/cmd/internal/obj/x86/asm6.go b/src/cmd/internal/obj/x86/asm6.go
index 16e73fad44..21d254d1e2 100644
--- a/src/cmd/internal/obj/x86/asm6.go
+++ b/src/cmd/internal/obj/x86/asm6.go
@@ -1982,7 +1982,7 @@ func makePjc(ctxt *obj.Link) *padJumpsCtx {
                return &padJumpsCtx{}
        }
        return &padJumpsCtx{
-               jumpAlignment: 32,
+               jumpAlignment: 16,
        }
 }
 
diff --git a/src/cmd/link/internal/amd64/obj.go b/src/cmd/link/internal/amd64/obj.go
index 3239c61864..f1f2e3e11c 100644
--- a/src/cmd/link/internal/amd64/obj.go
+++ b/src/cmd/link/internal/amd64/obj.go
@@ -40,9 +40,9 @@ func Init() (*sys.Arch, ld.Arch) {
        arch := sys.ArchAMD64
 
        fa := funcAlign
-       if objabi.GOAMD64 == "alignedjumps" {
-               fa = 32
-       }
+       //if objabi.GOAMD64 == "alignedjumps" {
+       //      fa = 32
+       //}
 
        theArch := ld.Arch{
                Funcalign:  fa,

I think we should be looking at the NOP-only patch and probably just have it turned on all the time. This seems like the least-risk way of avoiding this sometimes-terrible slowdown that will also interfere with performance-tuning work on updated-microcode Intel processors.

@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.