New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCaml Inline Assembly #162

Closed
wants to merge 56 commits into
base: trunk
from

Conversation

Projects
None yet
10 participants
@vbrankov

vbrankov commented Mar 31, 2015

Summary

This feature allows embedding assembler instructions within OCaml. It works and feels like inline assembly in GCC and supports almost everything that GCC supports. The main goal is to let OCaml use what modern CPUs have, like vectors, hardware FP or cache control, or be able to hand tune performance sensitive code. The feature grew out of my interest on how to introduce new native primitives. I tried to keep the compiler changes as small as possible. It's tested for amd64 and I wrote some use cases to illustrate its usefulness:

Goals

Modern CPUs introduce hundreds of new instructions. For example, vectors alone use well over a hundred dedicated instructions. A substantial compiler patch is required currently to introduce a single native primitive. The slowness by which new instructions have been trickling into the language might be an illustration of the difficulties. Introducing a support for new CPU instructions using inline assembly is much simpler, see for example binomial_pricer.ml. Such adaptability can make OCaml lag much less behind the developments in the CPU world or make it be used in new roles. Here's an overview of the intrinsics used in the Intel C compiler.

Embedded inline assembly also allows hand tuning performance sensitive code, since it avoids the OCaml to C cross which is too slow for some uses. The compiler is fully aware of the structure of the assembly code and can perform additional optimizations, for instance pull arguments directly from the memory, commute operands or avoid boxing. OCaml is arguably increasingly being used in places where speed is important and this may help.

Furthermore, the design of the compiler might get simpler, since introducing native primitives does not require adding to the compiler. Many recent additions to the compiler could arguably have been smaller or avoided, for example "%caml_string_get/set" or "sqrt".

How to use it?

In most cases, a single line is required to create a native primitive, for example (float_round.ml):

external floor : float -> float
  = "%asm" "floor_stub" "roundsd        $1, %0, %1      # floor" "mx" "=x"

The syntax and the feature set for the largest part closely follows GCC's inline assembly. GCC was used because its support for inline assembly is mature. Here's a good tutorial for GCC's inline asm. The unit test comprehensive.ml shows many of the examples from the tutorial implemented in OCaml.

I'm currently working on a proper manual.

The design

I tried to keep the patch minimal. In most cases the code is not changed, only new functions are added or a single "match" branch, so the current code should not be affected. A test suite is provided as well. The changes grouped by topic are:

  • The description of inline assembly primitives and parsing the OCaml code (typing/inline_asm.ml, typedecl.ml).
  • The main handling of inline assembly is in selectgen.ml. It chooses the cheapest alternative, selects the argument source (register, memory or immediate), assigns registers and inserts register moves.
  • Handles boxing/unboxing (asmcomp/cmmgen.ml).
  • A support was created for 128 and 256 bit integer and float vector types (asmcomp/cmm.ml, cmmgen.ml, printcmm.ml, selectgen.ml).
  • The stack size was increased for functions that use 128 or 256-bit vectors. Note: There is clearly a better solution, to increase the size of only slots that contain vector values, not all slots, however, I for now chose this approach to keep the patch simpler (asmcomp/reg.ml, amd64/emit.mlp).
  • the support for handling the vector registers (emit.mlp, x86_ast.mli, x86_gas.ml, x86_masm.ml).
  • Unit tests (testsuite/tests/inline-asm).

Platform dependent code

The architecture amd64 is well supported and tested. I created a framework and some support for the other architectures but it's not tested. Only small parts of the patch are architecture specific so finishing the support should be doable.

To write inline assembly for multiple platforms, a separate implementation must be provided for each platform. Since inline assembly primitives are indistinguishable from the standard OCaml to C externals, some platforms can have assembly calls, some C calls, and some can have OCaml code. This is the approach taken in many low level C libraries, such as glibc.

Code for different generations of CPUs can be specialized similarily, by having a superset for each new generation. For example, Float.SSE2 can implement floor using a C call, while Float.SSE4.1 can implement floor using a hardware primitive. This is very similar to what GCC does with the switches like -msse4.1 and is actually more general because it allows the code for different generations to exist within the same compiled executable.

For the byte code, an inline assembly primitive needs to have a C call provided.

Examples and benchmarks

Benchmarks on Xeon E5-2687W

  • float_min.ml - fast float minimum using a hardware primitive, 4.8x speedup
  • float_round.ml - fast float floor and ceil using hardware primitives, 16.2x speedup
  • three_operand.ml - faster float addition and multiplication using AVX hardware primitives, 1.74x speedup
  • string_index.ml - fast String.index using hardware primitives, 1.07-9.42x speedup
  • fast_complex.ml - fast complex number operations using hardware primitives, 6x speedup
  • binomial_pricer.ml - fast mathematical operations using hardware vector operations, 3.7x speedup
  • packed_type.ml - packed data records using inline assembly fields. Among other uses, this is necessary for the interface with HDF5.

Vladimir Brankov and others added some commits Apr 2, 2014

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

@bobot We cannot jump to a function, one reason that we will crash when the function tries to return. A function can only get called, in which case all the registers need to be parked.

The big differences is that ocaml does the call not the programmer. For C stub it will respect ABI (callee saved register stays, no?), for ocaml function it can choose to inline it, it can do tailcall optimisation ... It just give a lot more freedom to the compiler.

The code will be similar to

external gmp_add : Z.t -> Z.t -> Z.t = "ml_z_add"

let add x y =
    if Obj.is_int x && Obj.is_int y && add_test_nooverflow_and_put_result_in_rax x y
    then rax
    else gmp_add x y 
Contributor

bobot commented Apr 9, 2015

@bobot We cannot jump to a function, one reason that we will crash when the function tries to return. A function can only get called, in which case all the registers need to be parked.

The big differences is that ocaml does the call not the programmer. For C stub it will respect ABI (callee saved register stays, no?), for ocaml function it can choose to inline it, it can do tailcall optimisation ... It just give a lot more freedom to the compiler.

The code will be similar to

external gmp_add : Z.t -> Z.t -> Z.t = "ml_z_add"

let add x y =
    if Obj.is_int x && Obj.is_int y && add_test_nooverflow_and_put_result_in_rax x y
    then rax
    else gmp_add x y 
@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

Yes we can teach the user that there is a something to take into account, but you can do it everywhere else (except if I'm wrong and there is C externals in which float are not boxed 😄 ), so it is not very natural.

And just to correct a previous statement, there are C externals in which floats are not boxed.

Oh interesting, could you show me? I looked at ocaml documentation and standard library and for example, here float are boxed, no?

external exp : float -> float = "caml_exp_float" "exp" "float"
CAMLprim value caml_exp_float(value f)
{
  return caml_copy_double(exp(Double_val(f)));
}
Contributor

bobot commented Apr 9, 2015

Yes we can teach the user that there is a something to take into account, but you can do it everywhere else (except if I'm wrong and there is C externals in which float are not boxed 😄 ), so it is not very natural.

And just to correct a previous statement, there are C externals in which floats are not boxed.

Oh interesting, could you show me? I looked at ocaml documentation and standard library and for example, here float are boxed, no?

external exp : float -> float = "caml_exp_float" "exp" "float"
CAMLprim value caml_exp_float(value f)
{
  return caml_copy_double(exp(Double_val(f)));
}
@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Apr 9, 2015

Contributor

here float are boxed, no?

No, precisely: the (undocumented) "float" annotation tells the compiler that the native version of the primitive (which is exp not caml_exp_float) works on unboxed floats.

Contributor

alainfrisch commented Apr 9, 2015

here float are boxed, no?

No, precisely: the (undocumented) "float" annotation tells the compiler that the native version of the primitive (which is exp not caml_exp_float) works on unboxed floats.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 9, 2015

@bobot Regarding goto, you wrote that %ladd would be replaced by the label that correspond to the C function ml_z_add. However, you say that the call would not be done by the programmer, but by the OCaml. I don't understand how both can be true.

vbrankov commented Apr 9, 2015

@bobot Regarding goto, you wrote that %ladd would be replaced by the label that correspond to the C function ml_z_add. However, you say that the call would not be done by the programmer, but by the OCaml. I don't understand how both can be true.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

Ah thank you @alainfrisch . So that fit in my wish for asm, it is an annotation and not just the float type that tell that it can be unboxed 😀.

Contributor

bobot commented Apr 9, 2015

Ah thank you @alainfrisch . So that fit in my wish for asm, it is an annotation and not just the float type that tell that it can be unboxed 😀.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 9, 2015

@bobot For the float question, I just want to make sure I understand your request. Instead of determining whether an argument is unboxed based on its type, you advocate an explicit annotation. Is there anything else you advocate as a part of that discussion?

And if you have some asm code that work with argument of type 'a it will not work if written with a float which break a property of polymorphic function.

And just to make sure, if a float is passed to an asm external which accepts an argument 'a, the value will not be unboxed and the assembly argument will be the pointer. Therefore float will be treated as any other type.

vbrankov commented Apr 9, 2015

@bobot For the float question, I just want to make sure I understand your request. Instead of determining whether an argument is unboxed based on its type, you advocate an explicit annotation. Is there anything else you advocate as a part of that discussion?

And if you have some asm code that work with argument of type 'a it will not work if written with a float which break a property of polymorphic function.

And just to make sure, if a float is passed to an asm external which accepts an argument 'a, the value will not be unboxed and the assembly argument will be the pointer. Therefore float will be treated as any other type.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

I wrote "that correspond to the call to the C function", I should have been more precise, there is a call. I just try to follow what asm goto does but without explicit labels in ocaml. Asm goto allow you to jump in a local label, in ocaml we just have to execute an expression. We can choose to have a syntax like a pattern matching (each case a label), or anonymous function, or ...

I just want to push everyone to think out of the C box. Perhaps the best solution for ocaml is the gcc one, but before committing to a kind of asm inline I think we should look at other possibilities 🌸 .

The syntax could borrow from idea in ctypes where you define the argument using ocaml. In order to simplify the documentation we can use ocaml syntax and typing.

let add x y =
Asm.(
   x86
   ~input:[input_value ~force_reg:`Sil "x" x;
           input_value ~force_reg:`Dil "y" y]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res" output_end)
     ~continuation:[
         `End, (fun res () -> res);
         `Label "ladd", (fun _ () -> gmp_add x y)
      ]
)
Contributor

bobot commented Apr 9, 2015

I wrote "that correspond to the call to the C function", I should have been more precise, there is a call. I just try to follow what asm goto does but without explicit labels in ocaml. Asm goto allow you to jump in a local label, in ocaml we just have to execute an expression. We can choose to have a syntax like a pattern matching (each case a label), or anonymous function, or ...

I just want to push everyone to think out of the C box. Perhaps the best solution for ocaml is the gcc one, but before committing to a kind of asm inline I think we should look at other possibilities 🌸 .

The syntax could borrow from idea in ctypes where you define the argument using ocaml. In order to simplify the documentation we can use ocaml syntax and typing.

let add x y =
Asm.(
   x86
   ~input:[input_value ~force_reg:`Sil "x" x;
           input_value ~force_reg:`Dil "y" y]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res" output_end)
     ~continuation:[
         `End, (fun res () -> res);
         `Label "ladd", (fun _ () -> gmp_add x y)
      ]
)
@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

For the float part, I advocate just for what you said.

In my proposed syntax that would be by using something like input_float "f" f instead of input_value. Since you have done an awesome job by implementing all your idea, I should perhaps try to do the same with my proposition, at least the input syntax. Will you be interested in that?

Contributor

bobot commented Apr 9, 2015

For the float part, I advocate just for what you said.

In my proposed syntax that would be by using something like input_float "f" f instead of input_value. Since you have done an awesome job by implementing all your idea, I should perhaps try to do the same with my proposition, at least the input syntax. Will you be interested in that?

@lpw25

This comment has been minimized.

Show comment
Hide comment
@lpw25

lpw25 Apr 9, 2015

Contributor

Instead of determining whether an argument is unboxed based on its type, you advocate an explicit annotation.

I also prefer explicit annotations over reading the type, since reading the type plays badly with abstraction. Although it is worth noting that external already uses -> in the type to determine arity, rather than an annotation.

In order to simplify the documentation we can use ocaml syntax and typing.

It is not a good idea to make a compile-time operation look like a runtime one. The x86 in your example looks like a function, but it is not one. At least wrap this thing in [%asm ... ] to indicate it is not actual OCaml code.

Contributor

lpw25 commented Apr 9, 2015

Instead of determining whether an argument is unboxed based on its type, you advocate an explicit annotation.

I also prefer explicit annotations over reading the type, since reading the type plays badly with abstraction. Although it is worth noting that external already uses -> in the type to determine arity, rather than an annotation.

In order to simplify the documentation we can use ocaml syntax and typing.

It is not a good idea to make a compile-time operation look like a runtime one. The x86 in your example looks like a function, but it is not one. At least wrap this thing in [%asm ... ] to indicate it is not actual OCaml code.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 9, 2015

@bobot I think it's a great idea that you implement your input syntax. It might reveal non obvious problems.

Here's one: ~continuation takes closures or something that looks like closures as arguments. If those are allocated like closures, it won't be efficient. Not allocating closures looks like a complicated problem, for example what can be the body of such blocks.

An alternative may be to allow using exceptions from asm in some way. Exceptions are IMHO to some extent the closest thing in OCaml to goto. My recent experience with callbacks from asm to OCaml suggest that something like this might work.

vbrankov commented Apr 9, 2015

@bobot I think it's a great idea that you implement your input syntax. It might reveal non obvious problems.

Here's one: ~continuation takes closures or something that looks like closures as arguments. If those are allocated like closures, it won't be efficient. Not allocating closures looks like a complicated problem, for example what can be the body of such blocks.

An alternative may be to allow using exceptions from asm in some way. Exceptions are IMHO to some extent the closest thing in OCaml to goto. My recent experience with callbacks from asm to OCaml suggest that something like this might work.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 9, 2015

Contributor

@vbrankov okay I will do that.
@lpw25 what I like in using a usual module Asm with usual function x86, input_value (defined as primitive `external input_value: ?force_reg:register -> string -> 'a -> asm_input = "%asm_input_value" ) is that Merlin or Odoc will be able to autocomplete or link the documentations naturally without any specific treatment. But it is just convenient.

In ~continuation they are not allocated because they are like branches of a pattern matching. The syntax can also use pattern-matching instead of anonymous function like:

let add x y =
   match Asm.(x86
   ~input:[input_value ~force_reg:`Sil "x" x;
           input_value ~force_reg:`Dil "y" y]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res"))
  with
 | `End, res -> res
 | `Label "ladd", _ -> gmp_add x y
)
Contributor

bobot commented Apr 9, 2015

@vbrankov okay I will do that.
@lpw25 what I like in using a usual module Asm with usual function x86, input_value (defined as primitive `external input_value: ?force_reg:register -> string -> 'a -> asm_input = "%asm_input_value" ) is that Merlin or Odoc will be able to autocomplete or link the documentations naturally without any specific treatment. But it is just convenient.

In ~continuation they are not allocated because they are like branches of a pattern matching. The syntax can also use pattern-matching instead of anonymous function like:

let add x y =
   match Asm.(x86
   ~input:[input_value ~force_reg:`Sil "x" x;
           input_value ~force_reg:`Dil "y" y]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res"))
  with
 | `End, res -> res
 | `Label "ladd", _ -> gmp_add x y
)
@lpw25

This comment has been minimized.

Show comment
Hide comment
@lpw25

lpw25 Apr 9, 2015

Contributor

(Edit: Rewritten after rereading previous post)

@lpw25 what I like in using a usual module Asm with usual function x86, input_value (defined as primitive `external input_value: ?force_reg:register -> string -> 'a -> asm_input = "%asm_input_value" ) is that Merlin or Odoc will be able to autocomplete or link the documentations naturally without any specific treatment. But it is just convenient.

I really don't see the benefit of inventing a DSL which pretends to be real OCaml code. It will just confuse people when they try to do:

let add x y =
  let ia = input_value ~force_reg:`Sil "x" x in
  let ib = input_value ~force_reg:`Dil "y" y in
   match Asm.(x86
   ~input:[ia; ib]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res"))
  with
 | `End, res -> res
 | `Label "ladd", _ -> gmp_add x y
)

and get some weird error message.

Contributor

lpw25 commented Apr 9, 2015

(Edit: Rewritten after rereading previous post)

@lpw25 what I like in using a usual module Asm with usual function x86, input_value (defined as primitive `external input_value: ?force_reg:register -> string -> 'a -> asm_input = "%asm_input_value" ) is that Merlin or Odoc will be able to autocomplete or link the documentations naturally without any specific treatment. But it is just convenient.

I really don't see the benefit of inventing a DSL which pretends to be real OCaml code. It will just confuse people when they try to do:

let add x y =
  let ia = input_value ~force_reg:`Sil "x" x in
  let ib = input_value ~force_reg:`Dil "y" y in
   match Asm.(x86
   ~input:[ia; ib]
   "test    $1, %x
     jz      %ladd
     test    $1, %y
     jz      %ladd
     lea     -1(%x), %res
     add     %y, %res
     jo      %ladd"
     ~effect:[write_effect_registers [`Rax]]
     ~output_value:(output_value ~force_reg:`Rax "res"))
  with
 | `End, res -> res
 | `Label "ladd", _ -> gmp_add x y
)

and get some weird error message.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 22, 2015

Output arguments should now be properly implemented. For example, this function adds the first argument to the second argument:

external add : int64 -> int64 ref -> unit = "%asm" "add_stub"
       "add     %0, %1  # add" "imr" "+r" "" "cc"

The produced code is efficient, boxing and eliminated references are handled. The given example produces a single assembly line per call.

let () =
  let x = ref 1L in
  add 6L x;
  add 6L x;
        movq    $1, %rax
        add     $6, %rax  # add
        add     $6, %rax  # add

vbrankov commented Apr 22, 2015

Output arguments should now be properly implemented. For example, this function adds the first argument to the second argument:

external add : int64 -> int64 ref -> unit = "%asm" "add_stub"
       "add     %0, %1  # add" "imr" "+r" "" "cc"

The produced code is efficient, boxing and eliminated references are handled. The given example produces a single assembly line per call.

let () =
  let x = ref 1L in
  add 6L x;
  add 6L x;
        movq    $1, %rax
        add     $6, %rax  # add
        add     $6, %rax  # add
@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 22, 2015

@bobot Regarding goto, I don't think that only one register would be killed in the example that you provided. When it encounters a branch, OCaml assumes that the killed registers is a union of the registers killed in each branch. If a code can lead to a function call, even only conditionally, it will be treated as if it will destroy all the registers and all the registers will be spilled immediately and not only in the branch which requires spilling. In the given example both x and y will be spilled even though that's not necessary for one branch:

let add x y =
  let z = if x < 0 then 0 else max x y in
  z + x + y

If this is true, then I see no benefits in supporting goto.

vbrankov commented Apr 22, 2015

@bobot Regarding goto, I don't think that only one register would be killed in the example that you provided. When it encounters a branch, OCaml assumes that the killed registers is a union of the registers killed in each branch. If a code can lead to a function call, even only conditionally, it will be treated as if it will destroy all the registers and all the registers will be spilled immediately and not only in the branch which requires spilling. In the given example both x and y will be spilled even though that's not necessary for one branch:

let add x y =
  let z = if x < 0 then 0 else max x y in
  z + x + y

If this is true, then I see no benefits in supporting goto.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 23, 2015

@bobot This is an illustration that even C spills registers despite goto being used. If C had a strategy to spill the registers right before the call, that would mean that having calls in a loop would mean a lot of spilling and reloading.

int slow_add(int a, int b);

int add(int a, int b)
{
  int c;
  if (a < 0) goto slow;
  if (b < 0) goto slow;
  return a + b;
slow:
  return slow_add(a, b);
}

int main()
{
  int a, b, c;

  scanf("%d %d\n", &a, &b);
  c = add(a, b);
  printf("%d\n", a + b + c);
} 

vbrankov commented Apr 23, 2015

@bobot This is an illustration that even C spills registers despite goto being used. If C had a strategy to spill the registers right before the call, that would mean that having calls in a loop would mean a lot of spilling and reloading.

int slow_add(int a, int b);

int add(int a, int b)
{
  int c;
  if (a < 0) goto slow;
  if (b < 0) goto slow;
  return a + b;
slow:
  return slow_add(a, b);
}

int main()
{
  int a, b, c;

  scanf("%d %d\n", &a, &b);
  c = add(a, b);
  printf("%d\n", a + b + c);
} 
Vladimir Brankov
Change the way the frame is recorded after calls from the assembly. A…
…dd the register BP to the list of registers.
@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Apr 24, 2015

Contributor

@vbrankov I forgot that in ocaml all the registers are caller saved,
so I agree that if at the assembly level "a branch" of a conditionnal
does a function call spilling must be done for all the values that are
needed after the branch. However it is not the end of the story,
because optimisation (as I mentionned before inlining, tail call) can
remove this bad case. Firstly I will look at your C example and
compare what does GCC. Secondly I will look at your OCaml example and
show how OCaml's optimisation remove the bad case. I hope to prove
with these arguments that it is interesting to let the compiler do the
call itself because it can optimise it.

GCC

You said:

This is an illustration that even C spills registers despite goto
being used. If C had a strategy to spill the registers right before
the call, that would mean that having calls in a loop would mean a
lot of spilling and reloading.

int slow_add(int a, int b);

int add(int a, int b)
{
  if (a < 0) goto slow;
  if (b < 0) goto slow;
  return a + b;
slow:
  return slow_add(a, b);
}


int loop(int a, int b){

  while(a < 1000){
    if (a < 0) goto slow;
    if (b < 0) goto slow;
    a = a + b;
    continue;
  slow:
    a = slow_add(a, b);
  }

}

I don't understand your affirmation, if I compile that with just gcc -O1 -S -fverbose-asm test_alloc_c.c, gcc does not spill registers in the add function:

add:
.LFB0:
    .cfi_startproc
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 16
    movl    %edi, %eax  # a, tmp91
    shrl    $31, %eax   #, tmp91
    testb   %al, %al    # tmp91
    jne .L2 #,
    movl    %esi, %eax  # b, tmp94
    shrl    $31, %eax   #, tmp94
    testb   %al, %al    # tmp94
    jne .L2 #,
    leal    (%rdi,%rsi), %eax   #, D.1786
    jmp .L3 #
.L2:
    call    slow_add    #
.L3:
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

For the loop GCC spills some before the loop but just because they are
callee-saved registers, inside the loop only registers are used :

loop:
.LFB1:
    .cfi_startproc
    cmpl    $999, %edi  #, a
    jg  .L14    #,
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %esi, %ebx  # b, b
    movl    %esi, %ebp  # b, tmp95
    shrl    $31, %ebp   #, tmp95
.L10:
    movl    %edi, %eax  # a, tmp90
    shrl    $31, %eax   #, tmp90
    testb   %al, %al    # tmp90
    jne .L7 #,
    testb   %bpl, %bpl  # tmp95
    jne .L7 #,
    addl    %ebx, %edi  # b, a
    jmp .L8 #
.L7:
    movl    %ebx, %esi  # b,
    call    slow_add    #
    movl    %eax, %edi  #, a
.L8:
    cmpl    $999, %edi  #, a
    jle .L10    #,
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_restore 3
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_restore 6
    .cfi_def_cfa_offset 8
.L14:
    rep ret
    .cfi_endproc

If I take your OCaml example translated in C:

int max(int x, int y);

int add(int x, int y){

  int z;
  if (x < 1){
    z = 1;
  } else {
    z = max(x, y);
  };

  return z + x + y - 2;

}

Only callee saved registers are saved at the start of the function,
everything else is done in registers.

add:
.LFB0:
    .cfi_startproc
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %edi, %ebx  # x, x
    movl    %esi, %ebp  # y, y
    movl    $1, %eax    #, z
    testl   %edi, %edi  # x
    jle .L2 #,
    call    max #
.L2:
    leal    -2(%rax,%rbx), %eax #, D.1762
    leal    -2(%rbp,%rax), %eax #, D.1762
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

Other optimizations could reduce even more the penalty for the call.
For example if -O2 is used and the condition is replaced by
__builtin_expect(!!(x < 1),1) then the compiler duplicates the end
of the function. It is as if the function is written like:

int add(int x, int y){

  int z;
  if (__builtin_expect(!!(x < 1),1)){
    z = 1;
    return z + x + y - 2;
  } else {
    z = max(x, y);
    return z + x + y - 2;
  };

}

That proves that C compilers are able to avoid spilling when there is
a function call in one path. Or did I miss your point, did I miss
something in the asm?

OCaml

For this example:

let add x y =
  let z = if x < 0 then 0 else max x y in
  z + x + y

I agree that I was wrong when I said that OCaml could keep some data
in registers at the end of the condition if a call is done in the else
branch, I forgot that there is no callee-saved register in ocaml.
However OCaml can use inlining as usual in order to improve register
allocation. With the current version of the compiler:

  1. max x y is not inlined, if max is defined in Pervasives and whatever
    the value of -inline.
  2. max x y is not inlined, if max is defined locally and -inline
    is smaller than 2.
  3. max x y is inlined but the >= of max remains polymorphic, if max is
    defined locally and -inline is greater than 2.
  4. max x y is inlined and is just a compare-jump, if max is
    defined locally, specialized to int and whatever the value of -inline.
let max (x:int) y = if x >= y then x else y

I think we want one day to be able to use Pervasives.max on integer
and have just a comparison in assembly (perhaps with the final
flambda?) (case 4.). In that case asm-goto is interesting because it allows the
compiler to inline the function call or to do other optimizations.

In conclusion

  • I agree that in the Zarith example, doing the call inside the inline
    assembly or in ocaml will change nothing in the final assembly.
    But with asm-goto, you write the call in ocaml and don't have to
    bother with all the complications yourself.
  • If you want to call an ocaml function or just evaluate different expressions,
    asm-goto let the compilers do further optimization.

So I think inline asm need to handle multiple exit point.

Writing this answer took me more time than I though, perhaps I should
have kept this time to continue coding a proposition with asm-goto. ;)

Contributor

bobot commented Apr 24, 2015

@vbrankov I forgot that in ocaml all the registers are caller saved,
so I agree that if at the assembly level "a branch" of a conditionnal
does a function call spilling must be done for all the values that are
needed after the branch. However it is not the end of the story,
because optimisation (as I mentionned before inlining, tail call) can
remove this bad case. Firstly I will look at your C example and
compare what does GCC. Secondly I will look at your OCaml example and
show how OCaml's optimisation remove the bad case. I hope to prove
with these arguments that it is interesting to let the compiler do the
call itself because it can optimise it.

GCC

You said:

This is an illustration that even C spills registers despite goto
being used. If C had a strategy to spill the registers right before
the call, that would mean that having calls in a loop would mean a
lot of spilling and reloading.

int slow_add(int a, int b);

int add(int a, int b)
{
  if (a < 0) goto slow;
  if (b < 0) goto slow;
  return a + b;
slow:
  return slow_add(a, b);
}


int loop(int a, int b){

  while(a < 1000){
    if (a < 0) goto slow;
    if (b < 0) goto slow;
    a = a + b;
    continue;
  slow:
    a = slow_add(a, b);
  }

}

I don't understand your affirmation, if I compile that with just gcc -O1 -S -fverbose-asm test_alloc_c.c, gcc does not spill registers in the add function:

add:
.LFB0:
    .cfi_startproc
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 16
    movl    %edi, %eax  # a, tmp91
    shrl    $31, %eax   #, tmp91
    testb   %al, %al    # tmp91
    jne .L2 #,
    movl    %esi, %eax  # b, tmp94
    shrl    $31, %eax   #, tmp94
    testb   %al, %al    # tmp94
    jne .L2 #,
    leal    (%rdi,%rsi), %eax   #, D.1786
    jmp .L3 #
.L2:
    call    slow_add    #
.L3:
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

For the loop GCC spills some before the loop but just because they are
callee-saved registers, inside the loop only registers are used :

loop:
.LFB1:
    .cfi_startproc
    cmpl    $999, %edi  #, a
    jg  .L14    #,
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %esi, %ebx  # b, b
    movl    %esi, %ebp  # b, tmp95
    shrl    $31, %ebp   #, tmp95
.L10:
    movl    %edi, %eax  # a, tmp90
    shrl    $31, %eax   #, tmp90
    testb   %al, %al    # tmp90
    jne .L7 #,
    testb   %bpl, %bpl  # tmp95
    jne .L7 #,
    addl    %ebx, %edi  # b, a
    jmp .L8 #
.L7:
    movl    %ebx, %esi  # b,
    call    slow_add    #
    movl    %eax, %edi  #, a
.L8:
    cmpl    $999, %edi  #, a
    jle .L10    #,
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_restore 3
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_restore 6
    .cfi_def_cfa_offset 8
.L14:
    rep ret
    .cfi_endproc

If I take your OCaml example translated in C:

int max(int x, int y);

int add(int x, int y){

  int z;
  if (x < 1){
    z = 1;
  } else {
    z = max(x, y);
  };

  return z + x + y - 2;

}

Only callee saved registers are saved at the start of the function,
everything else is done in registers.

add:
.LFB0:
    .cfi_startproc
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %edi, %ebx  # x, x
    movl    %esi, %ebp  # y, y
    movl    $1, %eax    #, z
    testl   %edi, %edi  # x
    jle .L2 #,
    call    max #
.L2:
    leal    -2(%rax,%rbx), %eax #, D.1762
    leal    -2(%rbp,%rax), %eax #, D.1762
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

Other optimizations could reduce even more the penalty for the call.
For example if -O2 is used and the condition is replaced by
__builtin_expect(!!(x < 1),1) then the compiler duplicates the end
of the function. It is as if the function is written like:

int add(int x, int y){

  int z;
  if (__builtin_expect(!!(x < 1),1)){
    z = 1;
    return z + x + y - 2;
  } else {
    z = max(x, y);
    return z + x + y - 2;
  };

}

That proves that C compilers are able to avoid spilling when there is
a function call in one path. Or did I miss your point, did I miss
something in the asm?

OCaml

For this example:

let add x y =
  let z = if x < 0 then 0 else max x y in
  z + x + y

I agree that I was wrong when I said that OCaml could keep some data
in registers at the end of the condition if a call is done in the else
branch, I forgot that there is no callee-saved register in ocaml.
However OCaml can use inlining as usual in order to improve register
allocation. With the current version of the compiler:

  1. max x y is not inlined, if max is defined in Pervasives and whatever
    the value of -inline.
  2. max x y is not inlined, if max is defined locally and -inline
    is smaller than 2.
  3. max x y is inlined but the >= of max remains polymorphic, if max is
    defined locally and -inline is greater than 2.
  4. max x y is inlined and is just a compare-jump, if max is
    defined locally, specialized to int and whatever the value of -inline.
let max (x:int) y = if x >= y then x else y

I think we want one day to be able to use Pervasives.max on integer
and have just a comparison in assembly (perhaps with the final
flambda?) (case 4.). In that case asm-goto is interesting because it allows the
compiler to inline the function call or to do other optimizations.

In conclusion

  • I agree that in the Zarith example, doing the call inside the inline
    assembly or in ocaml will change nothing in the final assembly.
    But with asm-goto, you write the call in ocaml and don't have to
    bother with all the complications yourself.
  • If you want to call an ocaml function or just evaluate different expressions,
    asm-goto let the compilers do further optimization.

So I think inline asm need to handle multiple exit point.

Writing this answer took me more time than I though, perhaps I should
have kept this time to continue coding a proposition with asm-goto. ;)

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 24, 2015

@bobot In order to check whether an operation causes spilling, we should use some variables, do the operation and use the variables again. The C add function doesn't spill because it does nothing after the call. The loop doesn't spill b because it's in ebx which is caller safe. And generally, providing an example, even if it was correct, could not be a proof.

In your OCaml discussion, I used max exactly because I know it would produce a call and not get inlined.

I disagree with the conclusion that having asm-goto would make things simpler. The only argument that I see is that the calls would be in OCaml. How much simpler is that, can we have a side-by-side example? On the other hand, asm-goto looks like Pandora's box. For start, OCaml syntax doesn't have the means to represent it, so we need to come up with new language constructs. The compiler support would also be very complicated since an inline assembly is an OCaml primitive. I don't see any free lunch in trying to treat asm-goto differently, it would come down to introducing goto capability in OCaml.

vbrankov commented Apr 24, 2015

@bobot In order to check whether an operation causes spilling, we should use some variables, do the operation and use the variables again. The C add function doesn't spill because it does nothing after the call. The loop doesn't spill b because it's in ebx which is caller safe. And generally, providing an example, even if it was correct, could not be a proof.

In your OCaml discussion, I used max exactly because I know it would produce a call and not get inlined.

I disagree with the conclusion that having asm-goto would make things simpler. The only argument that I see is that the calls would be in OCaml. How much simpler is that, can we have a side-by-side example? On the other hand, asm-goto looks like Pandora's box. For start, OCaml syntax doesn't have the means to represent it, so we need to come up with new language constructs. The compiler support would also be very complicated since an inline assembly is an OCaml primitive. I don't see any free lunch in trying to treat asm-goto differently, it would come down to introducing goto capability in OCaml.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Apr 24, 2015

Member

On a more general viewpoint, my understanding of vbrankov's proposition is that its main advantage (compared to safer and also interesting approaches such as letting users write lambda and cmm code directly to define some of their primitives to avoid the cost of the C boundary without having to fork the compiler) is the support for exotic assembly instructions that the OCaml backend doesn't know about. If this is the problem that we aim to solve in this pull request, we should maybe focus on this use-case by assuming small ASM snippets with little to no control-flow.

That is not to diminish bobot's efforts to allow richer control flow to be expressed: it could be very interesting of course, but maybe we could converge on the simpler thing for this PR, discuss whether it is mergeable for this more specific purpose, and consider extensions that solve other problems only a second step.

Member

gasche commented Apr 24, 2015

On a more general viewpoint, my understanding of vbrankov's proposition is that its main advantage (compared to safer and also interesting approaches such as letting users write lambda and cmm code directly to define some of their primitives to avoid the cost of the C boundary without having to fork the compiler) is the support for exotic assembly instructions that the OCaml backend doesn't know about. If this is the problem that we aim to solve in this pull request, we should maybe focus on this use-case by assuming small ASM snippets with little to no control-flow.

That is not to diminish bobot's efforts to allow richer control flow to be expressed: it could be very interesting of course, but maybe we could converge on the simpler thing for this PR, discuss whether it is mergeable for this more specific purpose, and consider extensions that solve other problems only a second step.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 24, 2015

@bobot Leo and me discussed possible solutions for your problem to avoid spills and some ideas came up.

  • A special pair of primitives which spills and reads back all live registers. The code around this "block" would be treated as if it destroys no registers. In this example the effect would be that the registers would be spilled only if a branch with the call is taken, but then the cost of the call would eclipse the cost of spilling:
let add x y =
  let z = if x < 0 then 0 else begin
    spill_live ();
    max x y
    restore_live ()
  end in
  z + x + y
  • Write the C function with register variables to control precisely which registers are destroyed and hence reduce the set of destroyed variables on that call.

vbrankov commented Apr 24, 2015

@bobot Leo and me discussed possible solutions for your problem to avoid spills and some ideas came up.

  • A special pair of primitives which spills and reads back all live registers. The code around this "block" would be treated as if it destroys no registers. In this example the effect would be that the registers would be spilled only if a branch with the call is taken, but then the cost of the call would eclipse the cost of spilling:
let add x y =
  let z = if x < 0 then 0 else begin
    spill_live ();
    max x y
    restore_live ()
  end in
  z + x + y
  • Write the C function with register variables to control precisely which registers are destroyed and hence reduce the set of destroyed variables on that call.
@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Apr 24, 2015

Regarding alternative syntax for the calls, @lpw25 had a good suggestion to use attributes. Some possibilities

(* GCC identifiers, less verbose *)
external floor : (float [@unbox][@param "x"]) -> (float [@unbox][@param "=x"])
  = "%asm" "round_stub" "roundsd $0, %0, %1"
external mov : (int [@param "m,r,r"]) -> (int ref [@param "=r,m,r"]) -> unit
  = "%asm" "mov_stub" "mov %0, %1"

(* more verbose *)
external floor :
     (float [@unbox][@xmm])
  -> (float [@unbox][@output][@xmm])
  [@asm "roundsd $0, %0, %1"] = "round_stub"
external mov :
     (int [@memory][@alt][@register][@alt][@register])
  -> (int ref [@output][@register][@alt][@memory][@alt][@register])
  -> unit
  [@asm "mov %0, %1"] = "mov_stub"

vbrankov commented Apr 24, 2015

Regarding alternative syntax for the calls, @lpw25 had a good suggestion to use attributes. Some possibilities

(* GCC identifiers, less verbose *)
external floor : (float [@unbox][@param "x"]) -> (float [@unbox][@param "=x"])
  = "%asm" "round_stub" "roundsd $0, %0, %1"
external mov : (int [@param "m,r,r"]) -> (int ref [@param "=r,m,r"]) -> unit
  = "%asm" "mov_stub" "mov %0, %1"

(* more verbose *)
external floor :
     (float [@unbox][@xmm])
  -> (float [@unbox][@output][@xmm])
  [@asm "roundsd $0, %0, %1"] = "round_stub"
external mov :
     (int [@memory][@alt][@register][@alt][@register])
  -> (int ref [@output][@register][@alt][@memory][@alt][@register])
  -> unit
  [@asm "mov %0, %1"] = "mov_stub"
@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot May 4, 2015

Contributor

Inline asm with jump is implemented 1. Perhaps some ideas can be reused:

  • It defines a new module Asm in the stdlib 2, parsing, typing is done as usual, and third party tools (merlin, odoc) should already understand it. But as @lpw25 predicted if some sub-term is not a constant you have a strange error (yet localized :) ) "Configuration parameters of inline assembly must be constant" and "Configuration function for inline assembly can't be used outside inline assembly application".
  • It defines a function Asm.arch that return the current architecture(bytecode is considered as an architecture) and allows to match on it (the match is naturally eliminated early) in order to have different code or assembly for each architecture.
  • You can jump to a specified ocaml expression 3
  • Asm.ivalue and Asm.ovalue work with boxed value, Asm.ifloat and Asm.ofloat work with unboxed float.
  • Many things are not implemented, only registers are used (no direct access to the stack), all the virtual register used are said interfering, no unboxed integer, some errors are not catched during parsing but in late compilation phase...
  • Thank you for looking at my problem to avoid spills. In the spirit of your proposition with spill_live/spill_restore I created #180 which reloads early in the branch that destroy registers. If this optimization is deemed too much invasive it could be activated only on annotated branch [@slow_branch]. But I prefer the compiler to do directly the right thing 😀 .

Now that I understand better the problem I will review more in depth this merge-request.

PS: I can create a merge request for simplifying reading and testing of 1 .

Contributor

bobot commented May 4, 2015

Inline asm with jump is implemented 1. Perhaps some ideas can be reused:

  • It defines a new module Asm in the stdlib 2, parsing, typing is done as usual, and third party tools (merlin, odoc) should already understand it. But as @lpw25 predicted if some sub-term is not a constant you have a strange error (yet localized :) ) "Configuration parameters of inline assembly must be constant" and "Configuration function for inline assembly can't be used outside inline assembly application".
  • It defines a function Asm.arch that return the current architecture(bytecode is considered as an architecture) and allows to match on it (the match is naturally eliminated early) in order to have different code or assembly for each architecture.
  • You can jump to a specified ocaml expression 3
  • Asm.ivalue and Asm.ovalue work with boxed value, Asm.ifloat and Asm.ofloat work with unboxed float.
  • Many things are not implemented, only registers are used (no direct access to the stack), all the virtual register used are said interfering, no unboxed integer, some errors are not catched during parsing but in late compilation phase...
  • Thank you for looking at my problem to avoid spills. In the spirit of your proposition with spill_live/spill_restore I created #180 which reloads early in the branch that destroy registers. If this optimization is deemed too much invasive it could be activated only on annotated branch [@slow_branch]. But I prefer the compiler to do directly the right thing 😀 .

Now that I understand better the problem I will review more in depth this merge-request.

PS: I can create a merge request for simplifying reading and testing of 1 .

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov May 4, 2015

@bobot Thanks for the comprehensive example. Regarding multiple branches, I'm nicely surprised that avoiding creating closures doesn't look as difficult as I expected, although I haven't read the code. However, I am not convinced that all the problems are solved. This example boxes floats:

let x =
  let i = 1. in
  let open Asm in
  let y =
    match arch with
    | AMD64 ->
        amd64
          ~input:[ifloat "%0" i]
          "addsd        %0, %1  # func1"
          ~effect:[`VReg "%1"]
          ~output:(ofloat "%1" oend)
          ~label:[`End,(fun x () -> x)]
    | _ -> assert false
  in
  y > 1.
        movsd   .L105(%rip), %xmm0
        addsd   %xmm0, %xmm1    # func1
.L103:
        call    caml_alloc1@PLT
.L106:
        leaq    8(%r15), %rax
        movq    $1277, -8(%rax)
        movsd   %xmm1, (%rax)
.L102:
        movsd   .L105(%rip), %xmm0
        movsd   (%rax), %xmm1
        comisd  %xmm0, %xmm1

Not having branches allows the compiler to eliminate boxing. I feel it is difficult to implement this optimization with multiple branches.

external addsd : float -> float = "%asm" "" "addsd %0, %1" "x" "=x"

let x =
  let i = 1. in
  let y = addsd i in
  y > 1.
        movsd   .L103(%rip), %xmm0
        addsd %xmm0, %xmm1
        comisd  %xmm0, %xmm1

This is just an example, there may be more optimizations that may be difficult because the compiler needs to "see" and modify the code inside the branches.

vbrankov commented May 4, 2015

@bobot Thanks for the comprehensive example. Regarding multiple branches, I'm nicely surprised that avoiding creating closures doesn't look as difficult as I expected, although I haven't read the code. However, I am not convinced that all the problems are solved. This example boxes floats:

let x =
  let i = 1. in
  let open Asm in
  let y =
    match arch with
    | AMD64 ->
        amd64
          ~input:[ifloat "%0" i]
          "addsd        %0, %1  # func1"
          ~effect:[`VReg "%1"]
          ~output:(ofloat "%1" oend)
          ~label:[`End,(fun x () -> x)]
    | _ -> assert false
  in
  y > 1.
        movsd   .L105(%rip), %xmm0
        addsd   %xmm0, %xmm1    # func1
.L103:
        call    caml_alloc1@PLT
.L106:
        leaq    8(%r15), %rax
        movq    $1277, -8(%rax)
        movsd   %xmm1, (%rax)
.L102:
        movsd   .L105(%rip), %xmm0
        movsd   (%rax), %xmm1
        comisd  %xmm0, %xmm1

Not having branches allows the compiler to eliminate boxing. I feel it is difficult to implement this optimization with multiple branches.

external addsd : float -> float = "%asm" "" "addsd %0, %1" "x" "=x"

let x =
  let i = 1. in
  let y = addsd i in
  y > 1.
        movsd   .L103(%rip), %xmm0
        addsd %xmm0, %xmm1
        comisd  %xmm0, %xmm1

This is just an example, there may be more optimizations that may be difficult because the compiler needs to "see" and modify the code inside the branches.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot May 5, 2015

Contributor

The closure is not created simply because (fun x y -> t) t1 t2 is simplified early into let x = t1 in let y = t2 in t and I applied each branch with as many variable that there is output register and unit.

My motto is if Cifthenelse can do it Casminline should be able to do it 😜 . And for Cifthenelse there is some float boxing in g and not in f (Is there no patch for improving the g case?)

let g x b =
  let i = 1.0 in
  let y = if b then x +. 1. else x +. 2. in
  y > 1.

let f x b =
  let i = 1.0 in
  (if b then x +. 1. else x +. 2.) > 1.

So by modifying Cmmgen.unbox_float, a version of f with Casminline should be unboxed. But not with your example, when it is optimized for if, it should be possible to optimize it for asminline.

Contributor

bobot commented May 5, 2015

The closure is not created simply because (fun x y -> t) t1 t2 is simplified early into let x = t1 in let y = t2 in t and I applied each branch with as many variable that there is output register and unit.

My motto is if Cifthenelse can do it Casminline should be able to do it 😜 . And for Cifthenelse there is some float boxing in g and not in f (Is there no patch for improving the g case?)

let g x b =
  let i = 1.0 in
  let y = if b then x +. 1. else x +. 2. in
  y > 1.

let f x b =
  let i = 1.0 in
  (if b then x +. 1. else x +. 2.) > 1.

So by modifying Cmmgen.unbox_float, a version of f with Casminline should be unboxed. But not with your example, when it is optimized for if, it should be possible to optimize it for asminline.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot May 5, 2015

Contributor

It is your patch indeed 😄 #6260, but it is on let and it is applied. If the Uifthenelse case is also handled, g doesn't allocate anymore:

diff --git a/asmcomp/cmmgen.ml b/asmcomp/cmmgen.ml
index e3c723a..9a9f30f 100644
--- a/asmcomp/cmmgen.ml
+++ b/asmcomp/cmmgen.ml
@@ -1260,6 +1260,10 @@ let rec is_unboxed_number = function
         | _ -> No_unboxing
       end
   | Ulet (_, _, e) | Usequence (_, e) -> is_unboxed_number e
+  | Uifthenelse(_, e2, e3) ->
+      let is_e2 = is_unboxed_number e2 in
+      let is_e3 = is_unboxed_number e3 in
+      if is_e3 = is_e2 then is_e2 else No_unboxing
   | _ -> No_unboxing

 let subst_boxed_number unbox_fn boxed_id unboxed_id box_chunk box_offset exp =

Instead of writing one patch for one case, someone should perhaps try to complete is_unboxed_number as much as possible. I think I should be able to complete for Uasminline is_unboxed_number in a satisfactory way (just need to add an environment in is_unboxed_number for variables that are unboxed).

Contributor

bobot commented May 5, 2015

It is your patch indeed 😄 #6260, but it is on let and it is applied. If the Uifthenelse case is also handled, g doesn't allocate anymore:

diff --git a/asmcomp/cmmgen.ml b/asmcomp/cmmgen.ml
index e3c723a..9a9f30f 100644
--- a/asmcomp/cmmgen.ml
+++ b/asmcomp/cmmgen.ml
@@ -1260,6 +1260,10 @@ let rec is_unboxed_number = function
         | _ -> No_unboxing
       end
   | Ulet (_, _, e) | Usequence (_, e) -> is_unboxed_number e
+  | Uifthenelse(_, e2, e3) ->
+      let is_e2 = is_unboxed_number e2 in
+      let is_e3 = is_unboxed_number e3 in
+      if is_e3 = is_e2 then is_e2 else No_unboxing
   | _ -> No_unboxing

 let subst_boxed_number unbox_fn boxed_id unboxed_id box_chunk box_offset exp =

Instead of writing one patch for one case, someone should perhaps try to complete is_unboxed_number as much as possible. I think I should be able to complete for Uasminline is_unboxed_number in a satisfactory way (just need to add an environment in is_unboxed_number for variables that are unboxed).

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov May 5, 2015

@bobot The pull request 107 should do more unboxing.

vbrankov commented May 5, 2015

@bobot The pull request 107 should do more unboxing.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov May 5, 2015

@bobot All right, I now feel multiple branches is doable. I will examine it in detail and get back.

vbrankov commented May 5, 2015

@bobot All right, I now feel multiple branches is doable. I will examine it in detail and get back.

@vbrankov

This comment has been minimized.

Show comment
Hide comment
@vbrankov

vbrankov Jul 30, 2015

I've heard that there had been a discussion about this patch in the last OCaml Dev meeting and the conclusion was generally negative. Whoever knows about that meeting please let me know if there's any followup, questions or if anything can be salvaged out of this patch. For example, easily adding native primitives might still be a useful thing to have.

vbrankov commented Jul 30, 2015

I've heard that there had been a discussion about this patch in the last OCaml Dev meeting and the conclusion was generally negative. Whoever knows about that meeting please let me know if there's any followup, questions or if anything can be salvaged out of this patch. For example, easily adding native primitives might still be a useful thing to have.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Jul 30, 2015

Contributor

At the dev meeting, the following points were raised (in no particular order).

  • Complexity. Even with your best efforts, this is a fairly big and complex extension. (I know exactly what you went through here because at about the same time I was adding extended inline asm to the CompCert C compiler...).
  • Further adding to the complexity is the need for a mechanism (preprocessor or otherwise) to select the appropriate asm fragment for the target, or a fallback implementation.
  • Aesthetics. Some developers just don't want to see ugly asm templates with obscure % holes in their nifty Caml source files, and feel that the proper place for such code is in separate .s or .c with inline assembly files.
  • Performance gains wrt "noalloc" C/asm external functions. Calling "alloc" external functions is expensive indeed. However, with the new unboxing annotations (which all devs at the meeting liked, by the way), many more external functions can be declared "noalloc", including those for which you would typically feel the need for inline asm. The overhead of calling a "noalloc" external functions is relatively low. So, a plausible alternative to inline asm is just "noalloc" external functions implemented either in C with inline asm, or in assembly.
  • As a performance data point, I mentioned a recent experiment with the Zarith library where some "fast paths" are rewritten in portable OCaml, using Hacker's Delight tricks for overflow detection and what not, then inlined by ocamlopt, leaving the "alloc" external calls to the slow path. The performance obtained is not quite what you'd get with inlining "branch on overflow" asm instructions, but not that far either. Sometimes, the best way to use asm is to recognize that you don't need it :-)

The conclusion of our discussions is that we are not going to integrated this PR.

The question you raise about the difficulty of adding new, inlined primitives is a good one and remains open. My personal take on it is that perhaps we should think twice before adding such primitives and try "noalloc" external functions first.

Contributor

xavierleroy commented Jul 30, 2015

At the dev meeting, the following points were raised (in no particular order).

  • Complexity. Even with your best efforts, this is a fairly big and complex extension. (I know exactly what you went through here because at about the same time I was adding extended inline asm to the CompCert C compiler...).
  • Further adding to the complexity is the need for a mechanism (preprocessor or otherwise) to select the appropriate asm fragment for the target, or a fallback implementation.
  • Aesthetics. Some developers just don't want to see ugly asm templates with obscure % holes in their nifty Caml source files, and feel that the proper place for such code is in separate .s or .c with inline assembly files.
  • Performance gains wrt "noalloc" C/asm external functions. Calling "alloc" external functions is expensive indeed. However, with the new unboxing annotations (which all devs at the meeting liked, by the way), many more external functions can be declared "noalloc", including those for which you would typically feel the need for inline asm. The overhead of calling a "noalloc" external functions is relatively low. So, a plausible alternative to inline asm is just "noalloc" external functions implemented either in C with inline asm, or in assembly.
  • As a performance data point, I mentioned a recent experiment with the Zarith library where some "fast paths" are rewritten in portable OCaml, using Hacker's Delight tricks for overflow detection and what not, then inlined by ocamlopt, leaving the "alloc" external calls to the slow path. The performance obtained is not quite what you'd get with inlining "branch on overflow" asm instructions, but not that far either. Sometimes, the best way to use asm is to recognize that you don't need it :-)

The conclusion of our discussions is that we are not going to integrated this PR.

The question you raise about the difficulty of adding new, inlined primitives is a good one and remains open. My personal take on it is that perhaps we should think twice before adding such primitives and try "noalloc" external functions first.

@DemiMarie

This comment has been minimized.

Show comment
Hide comment
@DemiMarie

DemiMarie Oct 7, 2015

Contributor

One approach would be to allow inline primitives to be written as compiler plug-ins, written in OCaml, that use an embedded DSL (possibly Camlp4/ppx based?) to describe the assembly code and/or bytecode that needs to be generated.

Note that external functions – even if "noalloc" – are still far too heavyweight for primitives that are single machine instructions.

Contributor

DemiMarie commented Oct 7, 2015

One approach would be to allow inline primitives to be written as compiler plug-ins, written in OCaml, that use an embedded DSL (possibly Camlp4/ppx based?) to describe the assembly code and/or bytecode that needs to be generated.

Note that external functions – even if "noalloc" – are still far too heavyweight for primitives that are single machine instructions.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 25, 2015

Contributor

I'm closing this pull request so that we can better focus on other requests.

Contributor

xavierleroy commented Oct 25, 2015

I'm closing this pull request so that we can better focus on other requests.

@hannesm hannesm referenced this pull request Mar 11, 2017

Closed

performance #1

lpw25 pushed a commit to lpw25/ocaml that referenced this pull request Feb 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment