Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize 32->64 sign-extension for AMD64 #1631

Merged
merged 3 commits into from May 28, 2018

Conversation

Projects
None yet
4 participants
@LemonBoy
Copy link
Contributor

commented Feb 25, 2018

Before:

48 c1 e3 20            shl    rbx,0x20
48 c1 fb 20            sar    rbx,0x20

After:

48 63 db               movsxd rbx,ebx
-- or, if the value is in eax --
48 98                  cdqe

The same can be done for AArch64 using the sxtw op.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2018

Can the difference be observed (even on a micro-benchmark)?

@LemonBoy

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2018

A completely unscientific benchmark using the following code and using time gives the following results:

Version Time
trunk 2.828
trunk+flambda 3.944
trunk+PR 1.714
trunk+PR+flambda 2.273

The speedup and the reduction of code size is proportional to the operations made on 32bit integers.
It may also be useful to notice that flambda produces one more (useless) sign extension:

(let Paddbint_44/1259
   (>>s (<< (+ (>>s (<< Foo_y/1257 32) 32) 1) 32) 32)
  (assign Foo_y/1257 (>>s (<< Paddbint_44/1259 32) 32)))
(loop (assign y/1235 (>>s (<< (+ (>>s (<< y/1235 32) 32) 1) 32) 32))
let  y = ref (Int32.of_int 0) in                                            
for i=0 to 1000000000 do                                                      
    y := Int32.(succ !y)
done;                                               
print_int (Int32.to_int !y);                                             
print_newline ()
| Lop(Ispecific(Isextend)) ->
let src = res i 0 in
begin match src with
| Reg64 x when x = RAX -> I.cdqe ()

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Feb 28, 2018

Contributor
| Reg64 RAX -> ...
@gasche

This comment has been minimized.

Copy link
Member

commented Feb 28, 2018

(Edit: the benchmark above has been much improved, so the message above isn't relevant anymore.)

The benchmark does not give the numbers I would be looking for:

  • what about the state of trunk just before this PR? Comparing to 4.06.1 risks measuring other trunk changes in the difference.
  • could we get the "diff" of this PR (compared to trunk) both with and without flambda?

(I'm not saying you should re-do those benchmarks, I can't judge whether the benchmarks matter at all to evaluate the PR, just commenting on the proposed numbres.)

@@ -88,7 +88,7 @@ let pseudoregs_for_operation op arg res =
(* One-address unary operations: arg.(0) and res.(0) must be the same *)
| Iintop_imm((Iadd|Isub|Imul|Iand|Ior|Ixor|Ilsl|Ilsr|Iasr), _)
| Iabsf | Inegf
| Ispecific(Ibswap (32|64)) ->
| Ispecific(Ibswap (32|64)|Isextend) ->

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Feb 28, 2018

Contributor

The movsxd would support different source and target registers, no?

This comment has been minimized.

Copy link
@LemonBoy

LemonBoy Mar 1, 2018

Author Contributor

Yes, I've lifted this restriction in the latest commit.

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Mar 1, 2018

Contributor

If there a risk that the restriction was actually useful, in that it allowed to use more often the RAX case, resulting in better performance?

This comment has been minimized.

Copy link
@LemonBoy

LemonBoy Mar 1, 2018

Author Contributor

I don't think so, the gains coming from the use of a single cdqe are often shadowed by the gains brought by generating less mov in insert_op_debug (the worst case is 2 mov reg,reg for a whopping 6 bytes)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2018

I can't judge whether the benchmarks matter at all to evaluate the PR

Personally, I cannot judge an "optimization" PR without any number. There is always a cost to review and a risk inherent to any change, and the numbers are the justification for these.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2018

Cc:ing @chambart @mshinwell @lpw25 about the flambda "regression" reported above. Perhaps worth keeping track of it independently of this PR, though.

@LemonBoy

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2018

The benchmark does not give the numbers I would be looking for

I've re-run the test with four different compilers, the results should be slightly more meaningful now.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2018

Thanks, the numbers are convincing and I'm in favor of accepting the change. I'll give a few days to let other developers comment if they wish. In the meantime, can you address my inline comments and also:

  • Please add a Changelog entry.

  • Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

@alainfrisch alainfrisch self-assigned this Feb 28, 2018

@LemonBoy LemonBoy force-pushed the LemonBoy:signextend branch from ed85400 to 6d09cb6 Mar 1, 2018

@LemonBoy

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2018

Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

The lib-digest/md5.ml test generates both the instructions and executes correctly.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

This is touching the backend, so let's ask the boss: @xavierleroy, no opposition from your side on this PR?

@xavierleroy
Copy link
Contributor

left a comment

Two suggestions below.

@@ -42,6 +42,7 @@ type specific_operation =
| Ibswap of int (* endianness conversion *)
| Isqrtf (* Float square root *)
| Ifloatsqrtf of addressing_mode (* Float square root from memory *)
| Isextend (* Convert value with sign extension *)
and float_operation =

This comment has been minimized.

Copy link
@xavierleroy

xavierleroy Mar 7, 2018

Contributor

The name Isextend and the comment are not precise enough. There are instructions to sign-extend 8, 16 and 32-bit quantities.

This comment has been minimized.

Copy link
@LemonBoy

LemonBoy Mar 7, 2018

Author Contributor

Is Isextend32 better?

| Lop(Ispecific(Isextend)) ->
begin match (arg i 0, res i 0) with
| (Reg64 RAX, Reg64 RAX) -> I.cdqe ()
| (_, dst) -> I.movsxd (arg32 i 0) dst

This comment has been minimized.

Copy link
@xavierleroy

xavierleroy Mar 7, 2018

Contributor

I'm skeptical that it is worth recognizing and generating the cdqe special-case instruction. Emitting movsxd in all cases would be just as efficient in practice and keep the code simpler.

This comment has been minimized.

Copy link
@LemonBoy

LemonBoy Mar 7, 2018

Author Contributor

Roger that, I'll update the PR as soon as possible.

@LemonBoy LemonBoy force-pushed the LemonBoy:signextend branch from 70796f3 to 87a8e02 May 9, 2018

@LemonBoy

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2018

Amended and rebased, sorry for the delay.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented May 11, 2018

LGTM.

Can you add @xavierleroy as a reviewer to the Changes entry?

Will merge in a few days, unless @xavierleroy or someone else objects to it.

Rename Isextend to Isextend32
Drop the cdqe optimization and always use movsxd instead.

@LemonBoy LemonBoy force-pushed the LemonBoy:signextend branch from 87a8e02 to 8c10ca1 May 12, 2018

@alainfrisch alainfrisch merged commit 86d1f0d into ocaml:trunk May 28, 2018

0 of 2 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented May 28, 2018

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.