-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear destination registers before sqrt instruction on amd64 #9041
Conversation
@@ -786,8 +786,11 @@ let emit_instr fallthrough i = | |||
| Lop(Ispecific(Ibswap _)) -> | |||
assert false | |||
| Lop(Ispecific Isqrtf) -> | |||
if arg i 0 <> res i 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know nothing about the topic, but from your description, it's not obvious to me why clearing the destination register couldn't be beneficial when it is the same as the argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that would result in zero, regardless of the input value! (I had this bug in the first version of this patch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there's no "dependency breaking idiom" for zeroing the upper half only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't need to be. sqrtsd %xmm0, %xmm0
does depend on the previous value of %xmm0
. It is correct for the processor to wait for it to be computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lokks good to me.
I'm glad sqrt is the only operation that needs dependency breaking, because those xorpd
instructions increase code size.
This avoids a partial register stall.
903d502
to
4a48ac6
Compare
This is a tiny patch to
amd64.S
which adds a seemingly-useless instruction beforesqrtsd
, and makes this microbenchmark ~3x faster (compiled withocamlopt -unsafe
):This result is for an Intel Skylake, but I've seen a similar difference on several other Intel processors.
The
sqrtsd
instruction operates on the SSE vector registers, which are 128 bits wide and can store two 64-bit doubles. OCaml uses the SSE vector registers to implementfloat
arithmetic, using the scalar instructions (addsd
,mulsd
,sqrtsd
, etc.) to operate only on the lower half of the register.Most of the scalar operations leave the upper half of the register unchanged. This causes a dependency-tracking issue for unary operations: the operand that OCaml thinks is purely a destination register is also a source register (for a part of the register that OCaml doesn't use).
This means that the processor's dependency-tracking and out-of-order execution gets confused: when it sees
sqrtsd %xmm0, %xmm1
it will stall the instruction until the previous value of%xmm1
as been computed, even though this should be an independent operation.This is known as a partial register stall and has a standard fix: zero the entire destination register before the offending instruction, so that the processor does not see a dependency to some previous instruction (or in the benchmark above, to the same instruction on a previous iteration of the loop).
The register-zeroing is done by a dependency-breaking idiom (in this case,
xorpd
), which are special sequences recognised by the processor as not depending on the previous value.As far as I can tell, this only affects
sqrt
. It doesn't matter for binary operations, as both of their operands are genuinely sources, andsqrtsd
is currently the only unary SSE operation that OCaml generates.For more on partial register stalls and dependency-breaking idioms, read sections 3.5.1.8 and 3.5.2.4 of the Intel optimisation manual.
(Thanks to Andrew Hunter, Will Hasenplaugh, Spiros Eliopoulos and Brian Nigito for help figuring out what was going on here)