Clear destination registers before sqrt instruction on amd64 #9041

stedolan · 2019-10-15T14:10:46Z

This is a tiny patch to amd64.S which adds a seemingly-useless instruction before sqrtsd, and makes this microbenchmark ~3x faster (compiled with ocamlopt -unsafe):

let xs = [| 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42.; 42. |]
let go () = for i = 0 to 9 do xs.(i) <- sqrt xs.(i) done
let () = for i = 0 to 10_000_000 do go () done

This result is for an Intel Skylake, but I've seen a similar difference on several other Intel processors.

The sqrtsd instruction operates on the SSE vector registers, which are 128 bits wide and can store two 64-bit doubles. OCaml uses the SSE vector registers to implement float arithmetic, using the scalar instructions (addsd, mulsd, sqrtsd, etc.) to operate only on the lower half of the register.

Most of the scalar operations leave the upper half of the register unchanged. This causes a dependency-tracking issue for unary operations: the operand that OCaml thinks is purely a destination register is also a source register (for a part of the register that OCaml doesn't use).

This means that the processor's dependency-tracking and out-of-order execution gets confused: when it sees sqrtsd %xmm0, %xmm1 it will stall the instruction until the previous value of %xmm1 as been computed, even though this should be an independent operation.

This is known as a partial register stall and has a standard fix: zero the entire destination register before the offending instruction, so that the processor does not see a dependency to some previous instruction (or in the benchmark above, to the same instruction on a previous iteration of the loop).

The register-zeroing is done by a dependency-breaking idiom (in this case, xorpd), which are special sequences recognised by the processor as not depending on the previous value.

As far as I can tell, this only affects sqrt. It doesn't matter for binary operations, as both of their operands are genuinely sources, and sqrtsd is currently the only unary SSE operation that OCaml generates.

For more on partial register stalls and dependency-breaking idioms, read sections 3.5.1.8 and 3.5.2.4 of the Intel optimisation manual.

(Thanks to Andrew Hunter, Will Hasenplaugh, Spiros Eliopoulos and Brian Nigito for help figuring out what was going on here)

alainfrisch · 2019-10-15T14:15:43Z

asmcomp/amd64/emit.mlp

@@ -786,8 +786,11 @@ let emit_instr fallthrough i =
  | Lop(Ispecific(Ibswap _)) ->
      assert false
  | Lop(Ispecific Isqrtf) ->
+      if arg i 0 <> res i 0 then


I know nothing about the topic, but from your description, it's not obvious to me why clearing the destination register couldn't be beneficial when it is the same as the argument.

Because that would result in zero, regardless of the input value! (I had this bug in the first version of this patch)

And there's no "dependency breaking idiom" for zeroing the upper half only ?

There doesn't need to be. sqrtsd %xmm0, %xmm0 does depend on the previous value of %xmm0. It is correct for the processor to wait for it to be computed.

xavierleroy

Lokks good to me.

I'm glad sqrt is the only operation that needs dependency breaking, because those xorpd instructions increase code size.

asmcomp/amd64/emit.mlp

This avoids a partial register stall.

alainfrisch reviewed Oct 15, 2019

View reviewed changes

xavierleroy approved these changes Oct 15, 2019

View reviewed changes

asmcomp/amd64/emit.mlp Outdated Show resolved Hide resolved

Clear destination registers before sqrt instruction on amd64.

4a48ac6

This avoids a partial register stall.

stedolan force-pushed the sse-partial-register-stall branch from 903d502 to 4a48ac6 Compare October 15, 2019 15:03

Update Changes

0f87fea

xavierleroy merged commit 71f3ec4 into ocaml:trunk Oct 15, 2019

gretay-js mentioned this pull request Oct 10, 2023

Refactor x86-specific float64 intrinsics ocaml-flambda/flambda-backend#1878

Merged

TheNumbat mentioned this pull request Oct 10, 2023

Avoid partial register stall in roundsd ocaml-flambda/flambda-backend#1923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear destination registers before sqrt instruction on amd64 #9041

Clear destination registers before sqrt instruction on amd64 #9041

stedolan commented Oct 15, 2019

alainfrisch Oct 15, 2019

stedolan Oct 15, 2019

mlasson Oct 15, 2019

stedolan Oct 15, 2019

xavierleroy left a comment

Clear destination registers before sqrt instruction on amd64 #9041

Clear destination registers before sqrt instruction on amd64 #9041

Conversation

stedolan commented Oct 15, 2019

alainfrisch Oct 15, 2019

Choose a reason for hiding this comment

stedolan Oct 15, 2019

Choose a reason for hiding this comment

mlasson Oct 15, 2019

Choose a reason for hiding this comment

stedolan Oct 15, 2019

Choose a reason for hiding this comment

xavierleroy left a comment

Choose a reason for hiding this comment