Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/internal/obj/x86: pad jumps to avoid Intel erratum #35881
Intel erratum SKX102 “Processor May Behave Unpredictably Under Complex Sequence of Conditions Which Involve Branches That Cross 64-Byte Boundaries” applies to:
There is a microcode fix that can be applied by the BIOS to avoid the incorrect execution. It stops any jump (jump, jcc, call, ret, direct, indirect) from being cached in the decoded icache when the instruction ends at or crosses a 32-byte boundary. Intel says:
The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.
CL 206837 adds a new environment variable to set the padding policy. The original CL used $GO_X86_PADJUMP but the discussion has moved on to using $GOAMD64, which would avoid breaking the build cache.
There are really two questions here:
In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."
If the overheads involved are small enough and we are 100% confident in the padding code, we could stop there and just leave it on unconditionally. It seems like that's what we should do rather than open the door to arbitrary compiler option configuration in $GOAMD64, and all the complexity that comes with it.
So what are the overheads? Here is an estimate.
This go command has 10,362 functions, and the padding required for instructions crossing or ending at a 16-byte boundary should average out to 251,848 extra bytes. The awk adds 3 to conditional jumps to simulate fusing of a preceding register-register CMP instruction.
Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.
Same for a smaller binary:
Changing alignment to 32 would save 29,318.4 bytes but add 47,296 bytes.
Overall, the added bytes are 1.67% in both the go command and gofmt. This is not nothing, but it seems like a small price to pay for correct execution, and if it makes things faster on some systems, even better.
My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?
@rsc thank you for entering the bug and summarising the issue.
The assembler patches work by ensuring that neither standalone nor macro-fused jumps end on or cross 32 byte boundaries, not 64.
This is a difficult question to answer as there isn't a single value that is optimal across all architectures. The white paper notes that the current default of 5, may not be optimal for some Atom processors, not affected by the erratum. These processors may take longer to decode instructions that have more than 3 or 4 prefixes.
On the other hand, a default value of 3 is sub-optimal for processors affected by the Erratum as it means that there is less prefix space available for padding. If the instructions that precede a jump do not have sufficient prefix space available to pad that jump, the patch pads the jump with NOPs instead, which are less efficient.
I modified the patch locally so that it retains the existing 16 byte function alignment but pads jumps so that they do not end on or cross 16 byte boundaries. I found that this actually increases binary size slightly over the original patch which uses 32 byte alignment. So for the Go binary I see binary sizes of 15551238 (16 byte alignment) vs 15436550 (32 byte alignment). For go1.test I see 11800358 (16 byte alignment) vs 11722534 (32 byte alignment).
What is more worrying however, is that I see many more NOPs in the code stream when using 16 byte alignment. So for the go1.test, padding with 16 bytes yields 25001 NOPs (of varying size) where as padding with 32 bytes yields only 12375. This stands to reason really. The patch cannot always use prefixes to pad jumps and when prefixes can't be used, it falls back to NOPs. If we reduce function alignment to 16 we'll increase the number of jumps that need to be padded and, mostly likely, the number of NOPs that need to be inserted. This is likely to hurt performance.
I think the concerns about enabling the patch by default are:
On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.
@rsc Correct me if I'm wrong but my understanding is that there are two ways to fix the unpredictable behavior caused by this CPU erratum:
I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.
@knweiss your text also aligns with my understanding: If the affected CPU has an updated microcode then padding is only a performance improvement for the affected and microcode updated CPU while adding the padding is lower performance (decode bandwidth, icache usage) for all CPUs (unaffected or unpatched).
This looks like a performance tradeoff decision (as e.g. many operating systems load updated microcode automatically even if the bios is not updated) between affected and unaffected CPUs to me.
Is there data to understand how much this helps affected CPUs vs how much this will cause a performance regression in not affected CPUs?
Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.
A few percent here is not a big deal.
If the performance characteristics of multiple prefixes vary a lot by processor, I'd rather use the safe number (3?) and do larger padding with NOPs. That way performance is predictable, and we don't need subarch variants.
This is my sticking point. This patch will only be less useful over time (assuming Intel has stopped selling the bad chips), and it's not clear even now whether it is a net win. Answers to @knweiss and @martisch 's comments would help here.
I think we need to handle jumps like this correctly. Compute the target of the jump without padding, and adjust the jump if padding is inserted between the jump and target.
I don't think this is correct. As long as the
Thanks for the good discussion so far. Two small things:
Note that JMP 8(PC) does mean 8 bytes in the text segment. It means 8 instructions forward in the assembly file. Those targets are resolved fairly early in the assembler. You'd probably have to go out of your way to break that when inserting padding into the encoding of individual instructions. (But please don't of course.)
Keith is also correct about a GOAMD64 change causing a complete rebuild when the cache only has objects with different GOAMD64 settings, just as you'd hope.
@randall77 Regarding NOPs:
“I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.“ Source
Some more thoughts on the topic:
Padding and binary size:
Architecture specific padding:
Adding GOAMD64 options:
Maintenance of new compiler options:
Pseudo assembler doing more magic:
@knweiss This is essentially correct. The Go patch is designed to compensate for the performance effects of the microcode update on Go programs. See section 2.4 of the white paper for more details.
@randall77 There's some additional information about the use of NOPs on Intel hardware in section "220.127.116.11 Using NOPs" of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. I will try to gather some data on prefixes vs NOPs in the context of this patch.
Ah, this is good news.
Phoronix has some benchmarks for GCC https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode
It seems like adding padding can also further lower performance in benchmarks for affected CPUs with new microcode.
Here are some benchmarks results that illustrate the effects of the microcode update and the software mitigation on Go programs.
Unless otherwise noted, all benchmarks were generated with either a local build of master at #99957b6 or with version 3 of the patch applied on top of #99957b6.
Effects of the Go Software Mitigation on Build Time and Binary Size
In our tests, the software mitigation can impact the Go compiler between 0 – 8%. Results for the Go compiler benchmark suite are presented below. These results slightly favor the results “new-time/op” column generated by the patched Go compiler as they were run on a machine on which the microcode update had been applied. The table shows the increase in time taken to compile a set of packages from the Go standard library.
In our tests, Go binaries built with the software mitigation show size increases of between 1.4% and 2.0%. Here are some examples,
Effects of the Go software mitigation on Go benchmarks
The effect of the Go software mitigation on the performance of generated binaries on machines without the microcode update applied is shown below.
Effects of the Microcode Update on Go benchmarks
Most benchmarks fall into the 0-4% range. However, there are outliers.
Effectiveness of the Go Software Mitigation
Here we see benchmarks compiled without the Go software mitigation vs benchmarks compiled with the software mitigation on a machine with the new microcode update.
Here we compare the results of the benchmark suite built with an unpatched Go compiler running on old microcode versus benchmarks compiled with the software mitigation and run on the new microcode.
Comparing maximum of 5 prefixes to NOPs only
The next set of benchmarks were all run on the test machine with the microcode update. The results for the second column were generated by the Go software mitigation that permits a maximum of 5 prefixes per instructions. The results for the 3rd column are generated by a version of this mitigation that uses only NOPs to pad jumps.
For more complete information about performance and benchmark results, visit
In the end I couldn't resist the challenge and I did try to break JMP n(PC), but predictably, I failed. I contrived a small assembly language function with a JMP 2(PC) instruction that was directly followed by an unconditional jump. If the JMP 2(PC) instruction works correctly, this second jump instruction should be skipped. NOPs were used at the start of the function to ensure that the second jump instruction ended on a 32 byte boundary and would then require padding with the patch enabled.
Here's what the disassembly looks like with the patch disabled.
And here's what it looks like with the patch enabled.
Note that there is now a 2 byte NOP between the two jump instructions and that the second jump starts at a new 32 byte boundary. Note also that the target of the first jump, the JMP 2(PC) instruction, has correctly been incremented by 2 bytes (it's encoded as eb 04 instead of eb 02), so the second jump is still skipped and everything works as expected.
The only way I could get this to fail was to replace the JMP 2(PC) instruction with the byte sequence of a positive relative jump of two bytes directly in the assembly code, e.g.,
Note this is the encoding of the first jump in the first example. When I did this I found that the second jump was skipped when the patch was disabled but executed when the patch was enabled, as the first jump instruction was not updated. With the patch enabled we then skip only the inserted 2 byte NOP instead of the NOP and the second jump. I can't imagine why anyone would want to write assembly code like this, given we have labels and JMP 2(PC), or even whether this sort of thing is officially supported. I mention it however, for completeness.
@rsc I've updated the patch to use GOAMD64 and uploaded some benchmark data that show the effects of the patch with and without the microcode update. I also double checked to make sure that the patch doesn't break JMP PC(n) instructions, which it doesn't, although it can be fooled by jumps inserted into assembly code using BYTE statements. Is there now enough information available to make a decision on this bug?
The additional benchmark I'd like to see is before and after the Go padding CL, on a chip that doesn't have the underlying bug. On an AMD chip, for example.
Some more info:
The padding buys us back 1.2% in performance for a cost of 0.5% code size.
This set of benchmarks shows prefixes are better than nops by 0.3-0.5%. They are paying 2-3% in code size.
In all of the LLVM discussion, I didn't see any mention of performance cost on non-affected chips. Maybe that's something they don't have to deal with, but we do.
The Go benchmarks posted by @markdryan show a performance improvement of 3.5% for a space cost of 1.4-2%. That seems quite a bit better than the C benchmarks were demonstrating. Not sure why that would be. (Maybe Go code is more branchy?)
My opinion is we should just do nothing. This is tricky (see all the discussion in here about how this breaks debuggers in various ways), and the performance deltas just aren't that large.
I have re-run the benchmarks on a Haswell Macbook (Core i7-4870HQ), the results of which I will post below. Some benchmarks are also available here although these were generated using an earlier version of the patch and it’s not clear what CPU was used. Finally, the second set of benchmarks posted above show the effect of the Go software mitigation on a machine which is affected by the Erratum but does not have the microcode update applied.
It’s important to note that the software patch doesn’t just reduce the impact of the microcode update on performance. It also reduces the impact of the microcode update on consistency of performance, from one build to another. Consider this code scenario. Function “A” which contains a macro-fused jump that implements a tight loop. This jump does not cross or end on a 32-byte boundary and incurs no performance penalty on a machine that is affected by the Erratum and that is running the microcode update. A new function, “B”, is then added to the program. B is not related to A in anyway, in that it is not invoked directly or indirectly by A, nor does it invoke A. The linker however happens to place B before A in the final binary. This changes the alignment of A, such that its macro fused jump now crosses a 32-byte boundary, which in turn may degrade function A’s performance. Relatedly, without the software mitigation, performance effects, such as this, may accompany updates to a development environment, e.g., a compiler).
This is not just a hypothetical concern. Experiments conducted with the Revcomp (Go) benchmark can demonstrate this variability, or more simply, how to prevent it. In the benchmark data provided above, you will see that Revcomp built at commit #99957b6 and tested on a system with the microcode update, had a 19% performance effect in result. However, this loss can be eliminated with an unpatched compiler, simply by adding a few functions to the benchmark’s executable. Since function addition (or removal) unrelated to the Revcomp function (i.e. not invoked by Revcomp) can change its alignment (and thus benchmark performance), software mitigation is needed on machines affected by the Erratum to mitigate against this effect of the microcode update on code performance predictability.
Therefore, without a software mitigation it could be more difficult to rely on benchmark data for comparative purposes, when this data is generated on machines affected by the Erratum that have the microcode update applied. One won’t know whether performance gains from a new optimization is due to skillful programming or because added code happens to have displaced a critical jump, given the potentially significant run to run variability.
What is the specific issue with debuggers? I did follow the link but didn’t see any direct discussion of debuggers, although I am not familiar with LLVM so I may have missed it. Is this an issue with prefix padding, NOP padding or both? I could provide a version of the patch that uses only NOPs to pad jumps. Note that the NOP only patch, like the prefix padding patch, relies on an existing mechanism in the assembler for inserting NOPs into the code stream, although it doesn't seem to be used currently (since the NACL code was removed).
Impact of the Go software mitigation on a machine unaffected by the Erratum
Impact of prefix padding on machines unaffected by the Erratum
The following tables show the effect of the Go software mitigation that uses prefixes to pad jumps on a machine that is not affected by the Erratum. The second column in the tables below represent the benchmark results achieved using a build of master at commit #99957b6. The third column shows the results obtained when the software mitigation is applied on top of commit #99957b6. The patch reduced the geomean run time for the time based tests by 0.24% (range -6.9% to +2.68%) and lowered the geomean throughput for the throughput tests by 0.21% (“range -2.06% to 0%).
Impact of NOP padding on machines unaffected by the Erratum
The following tables show the effect of the Go software mitigation that uses only NOPs to pad jumps on a machine that is not affected by the Erratum. The second column in the tables below represent the benchmark results achieved using a build of master at commit #99957b6. The third column shows the results obtained when the NOP only version of the software mitigation is applied on top of commit #99957b6. The patch increased the geomean run time for the time based tests by 0.29% (range -6.25% to +4.85%) and lowered the geomean throughput for the throughput tests by 0.80% (“range -4.58% to 0.80%).
For more complete information about performance and benchmark results, visit
As far as I understand cache alignment/loop stream decoder block alignment effects not only happen for non-padded jumps (we have had this happened on earlier versions of go and not affected chips already) and aligning functions on 32 byte boundaries instead of 16 bytes alone would solve many of these without requiring padded jumps. Just aligning functions on 32byte generally sound like a good idea and is likely to have positive effects even on chips not effected by the erratum.
Which leaves the remaining question if padded jumps are needed in addition.
This is a general problem with Go on x86 processors. We frequently see changes in performance from one build to another. It would be nice to address that in general, but I'm a bit skeptical that this particular approach will fix all such problems.
The problem of varying performance, from build to build, is likely to get worse on machines affected by the Erratum, once the microcode update is applied. The patch discussed in this issue doesn’t attempt to solve the general problem of varying performance. It only attempts to mitigate against the effect of the microcode update on build to build benchmark consistency on machines affected by the Erratum. I have changed the wording of my previous post to better reflect this.
Aligning functions on 32 byte boundaries would prevent the build to build consistency problems caused by the microcode update on machines affected by the Erratum in the one scenario I described above. However, this scenario is just one example of how the microcode update can affect both performance, and consistency of performance, from one build to another. There are others. A seemingly innocuous change to a function or a minor update to a user’s development environment, that pulls in a small update ( to an inlinable function, for example) could change the alignment of jumps in the user’s code. On machines affected by the Erratum and running the microcode update, there’s an increased risk that such changes will have a performance effect. Padding is required on those machines to mitigate against this increased risk.
I agree that the effects you describe can happen and cause performance regressions. I think on the other side there are also effects that happen just because of the padding. A jump that might have previously aligned now is misaligned. Code that previously would have fit into x cache lines now takes x+1 cache lines. So there are performance regressions that purely happen due to the padding. So the tradeoff is between adding the padding, added options/code and having regressions and having regressions in other places.
If we are purely going to do it to stabilize performance then it should not be an option but the default as that is what I expect the majority of Go users to run. Otherwise if users are turning on and off the option to see performance effect they will also measure the effects of code not padded by jumps now being aligned differently. However turning on the option for all cpus also applies padding to cpus that dont need it and change their performance (for better or worse).
Just so everyone knows, the penalty can be quite bad, as reported in #37190.
I reproduced this,
Note that alignment improves the best case, so the best-to-worst slowdown exceeds 100% when things line up just so.
For that set of benchmarks (excluding those affected by a not-fully-mitigated MacOS bug):
The benchmarks were those in https://github.com/dr2chase/bent
In another benchmark run, I also checked the size and performance costs of 16 vs 32-byte alignment; we want 32-byte alignment, 16 gives 0.7% bigger text and 0.82% slower geomean execution, with almost no winners in the run-time column.
For reference, the two benchmark configurations:
and git diff in go:
I think we should be looking at the NOP-only patch and probably just have it turned on all the time. This seems like the least-risk way of avoiding this sometimes-terrible slowdown that will also interfere with performance-tuning work on updated-microcode Intel processors.