Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear destination registers before sqrt instruction on amd64 #9041

Merged
merged 2 commits into from
Oct 15, 2019

Conversation

stedolan
Copy link
Contributor

This is a tiny patch to amd64.S which adds a seemingly-useless instruction before sqrtsd, and makes this microbenchmark ~3x faster (compiled with ocamlopt -unsafe):

let xs = [| 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42. |]
let go () = for i = 0 to 9 do xs.(i) <- sqrt xs.(i) done
let () = for i = 0 to 10_000_000 do go () done

This result is for an Intel Skylake, but I've seen a similar difference on several other Intel processors.


The sqrtsd instruction operates on the SSE vector registers, which are 128 bits wide and can store two 64-bit doubles. OCaml uses the SSE vector registers to implement float arithmetic, using the scalar instructions (addsd, mulsd, sqrtsd, etc.) to operate only on the lower half of the register.

Most of the scalar operations leave the upper half of the register unchanged. This causes a dependency-tracking issue for unary operations: the operand that OCaml thinks is purely a destination register is also a source register (for a part of the register that OCaml doesn't use).

This means that the processor's dependency-tracking and out-of-order execution gets confused: when it sees sqrtsd %xmm0, %xmm1 it will stall the instruction until the previous value of %xmm1 as been computed, even though this should be an independent operation.

This is known as a partial register stall and has a standard fix: zero the entire destination register before the offending instruction, so that the processor does not see a dependency to some previous instruction (or in the benchmark above, to the same instruction on a previous iteration of the loop).

The register-zeroing is done by a dependency-breaking idiom (in this case, xorpd), which are special sequences recognised by the processor as not depending on the previous value.

As far as I can tell, this only affects sqrt. It doesn't matter for binary operations, as both of their operands are genuinely sources, and sqrtsd is currently the only unary SSE operation that OCaml generates.

For more on partial register stalls and dependency-breaking idioms, read sections 3.5.1.8 and 3.5.2.4 of the Intel optimisation manual.

(Thanks to Andrew Hunter, Will Hasenplaugh, Spiros Eliopoulos and Brian Nigito for help figuring out what was going on here)

@@ -786,8 +786,11 @@ let emit_instr fallthrough i =
| Lop(Ispecific(Ibswap _)) ->
assert false
| Lop(Ispecific Isqrtf) ->
if arg i 0 <> res i 0 then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know nothing about the topic, but from your description, it's not obvious to me why clearing the destination register couldn't be beneficial when it is the same as the argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that would result in zero, regardless of the input value! (I had this bug in the first version of this patch)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there's no "dependency breaking idiom" for zeroing the upper half only ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't need to be. sqrtsd %xmm0, %xmm0 does depend on the previous value of %xmm0. It is correct for the processor to wait for it to be computed.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lokks good to me.

I'm glad sqrt is the only operation that needs dependency breaking, because those xorpd instructions increase code size.

asmcomp/amd64/emit.mlp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants