Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Intel erratum SKX102 “Processor May Behave Unpredictably Under Complex Sequence of Conditions Which Involve Branches That Cross 64-Byte Boundaries” applies to:
There is a microcode fix that can be applied by the BIOS to avoid the incorrect execution. It stops any jump (jump, jcc, call, ret, direct, indirect) from being cached in the decoded icache when the instruction ends at or crosses a 32-byte boundary. Intel says:
The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.
CL 206837 adds a new environment variable to set the padding policy. The original CL used $GO_X86_PADJUMP but the discussion has moved on to using $GOAMD64, which would avoid breaking the build cache.
There are really two questions here:
In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."
If the overheads involved are small enough and we are 100% confident in the padding code, we could stop there and just leave it on unconditionally. It seems like that's what we should do rather than open the door to arbitrary compiler option configuration in $GOAMD64, and all the complexity that comes with it.
So what are the overheads? Here is an estimate.
This go command has 10,362 functions, and the padding required for instructions crossing or ending at a 16-byte boundary should average out to 251,848 extra bytes. The awk adds 3 to conditional jumps to simulate fusing of a preceding register-register CMP instruction.
Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.
Same for a smaller binary:
Changing alignment to 32 would save 29,318.4 bytes but add 47,296 bytes.
Overall, the added bytes are 1.67% in both the go command and gofmt. This is not nothing, but it seems like a small price to pay for correct execution, and if it makes things faster on some systems, even better.
My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?
@rsc thank you for entering the bug and summarising the issue.
The assembler patches work by ensuring that neither standalone nor macro-fused jumps end on or cross 32 byte boundaries, not 64.
This is a difficult question to answer as there isn't a single value that is optimal across all architectures. The white paper notes that the current default of 5, may not be optimal for some Atom processors, not affected by the erratum. These processors may take longer to decode instructions that have more than 3 or 4 prefixes.
On the other hand, a default value of 3 is sub-optimal for processors affected by the Erratum as it means that there is less prefix space available for padding. If the instructions that precede a jump do not have sufficient prefix space available to pad that jump, the patch pads the jump with NOPs instead, which are less efficient.
I modified the patch locally so that it retains the existing 16 byte function alignment but pads jumps so that they do not end on or cross 16 byte boundaries. I found that this actually increases binary size slightly over the original patch which uses 32 byte alignment. So for the Go binary I see binary sizes of 15551238 (16 byte alignment) vs 15436550 (32 byte alignment). For go1.test I see 11800358 (16 byte alignment) vs 11722534 (32 byte alignment).
What is more worrying however, is that I see many more NOPs in the code stream when using 16 byte alignment. So for the go1.test, padding with 16 bytes yields 25001 NOPs (of varying size) where as padding with 32 bytes yields only 12375. This stands to reason really. The patch cannot always use prefixes to pad jumps and when prefixes can't be used, it falls back to NOPs. If we reduce function alignment to 16 we'll increase the number of jumps that need to be padded and, mostly likely, the number of NOPs that need to be inserted. This is likely to hurt performance.
I think the concerns about enabling the patch by default are:
On the other hand, another issue with not having the patch enabled by default, is that to really take advantage of it, you would need to build the go tool chain yourself, otherwise the standard library functions that your program links against would presumably not be compiled with the mitigation.
@rsc Correct me if I'm wrong but my understanding is that there are two ways to fix the unpredictable behavior caused by this CPU erratum:
I.e. the padding is actually not required IFF Go programs can assume to run under the latest microcode versions. In this case the padding is "just" a performance optimization for the affected CPUs with the latest microcode fixes.
@knweiss your text also aligns with my understanding: If the affected CPU has an updated microcode then padding is only a performance improvement for the affected and microcode updated CPU while adding the padding is lower performance (decode bandwidth, icache usage) for all CPUs (unaffected or unpatched).
This looks like a performance tradeoff decision (as e.g. many operating systems load updated microcode automatically even if the bios is not updated) between affected and unaffected CPUs to me.
Is there data to understand how much this helps affected CPUs vs how much this will cause a performance regression in not affected CPUs?
Where does this effect come from? We're wasting the same icache space in either case (padding with prefixes or with no-ops). Is it the decoded instruction cache? In any case, we should measure to see how much NOPs are worse than prefixes.
A few percent here is not a big deal.
If the performance characteristics of multiple prefixes vary a lot by processor, I'd rather use the safe number (3?) and do larger padding with NOPs. That way performance is predictable, and we don't need subarch variants.
This is my sticking point. This patch will only be less useful over time (assuming Intel has stopped selling the bad chips), and it's not clear even now whether it is a net win. Answers to @knweiss and @martisch 's comments would help here.
I think we need to handle jumps like this correctly. Compute the target of the jump without padding, and adjust the jump if padding is inserted between the jump and target.
I don't think this is correct. As long as the
Thanks for the good discussion so far. Two small things:
Note that JMP 8(PC) does mean 8 bytes in the text segment. It means 8 instructions forward in the assembly file. Those targets are resolved fairly early in the assembler. You'd probably have to go out of your way to break that when inserting padding into the encoding of individual instructions. (But please don't of course.)
Keith is also correct about a GOAMD64 change causing a complete rebuild when the cache only has objects with different GOAMD64 settings, just as you'd hope.
@randall77 Regarding NOPs:
“I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.“ Source
Some more thoughts on the topic:
Padding and binary size:
Architecture specific padding:
Adding GOAMD64 options:
Maintenance of new compiler options:
Pseudo assembler doing more magic:
@knweiss This is essentially correct. The Go patch is designed to compensate for the performance effects of the microcode update on Go programs. See section 2.4 of the white paper for more details.
@randall77 There's some additional information about the use of NOPs on Intel hardware in section "18.104.22.168 Using NOPs" of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. I will try to gather some data on prefixes vs NOPs in the context of this patch.
Ah, this is good news.
Phoronix has some benchmarks for GCC https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode
It seems like adding padding can also further lower performance in benchmarks for affected CPUs with new microcode.