New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a lookup table for matches with constant results #863

Merged
merged 1 commit into from Dec 9, 2016

Conversation

Projects
None yet
7 participants
@stedolan
Contributor

stedolan commented Oct 18, 2016

match expressions are always compiled to a control structure, using branches or jump tables to execute the correct case. When the cases are constants, it's more efficient to use a data structure and load the answer from a lookup table. Current OCaml compiles this:

type t = A | B | C
let to_char = function A -> 'A' | B -> 'B' | C -> 'C'

to this Lambda:

  (to_char/1203 =
     (function param/1208
       (switch* param/1208 case int 0: 'A'
                           case int 1: 'B'
                           case int 2: 'C')))

which results in this x86 assembly, doing several comparisons to get the result:

camlSwitches__to_char_4:
        .cfi_startproc
.L103:
        sarq    $1, %rax
        cmpq    $1, %rax
        je      .L101
        jg      .L100
.L102:
        movq    $131, %rax
        ret
        .align  4
.L101:
        movq    $133, %rax
        ret
        .align  4
.L100:
        movq    $135, %rax
        ret

With this patch, the Lambda produced is:

  (to_char/1203 =
     (function param/1208
       (array.unsafe_get[addr] [0: 'A' 'B' 'C'] param/1208)))

which compiles to this assembly, loading the result from an array of constants:

camlSwitches__to_char_4:
        .cfi_startproc
.L100:
        movq    camlSwitches__Pmakeblock_20@GOTPCREL(%rip), %rbx
        movq    -4(%rbx,%rax,4), %rax
        ret
@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 18, 2016

Member

Two questions pop to mind:

  1. what happens when the constants returned from the match are of different type?
  2. when the match is inlined, does the new form prevent optimizations that previously worked?
Member

gasche commented Oct 18, 2016

Two questions pop to mind:

  1. what happens when the constants returned from the match are of different type?
  2. when the match is inlined, does the new form prevent optimizations that previously worked?
@chambart

This comment has been minimized.

Show comment
Hide comment
@chambart

chambart Oct 19, 2016

Contributor

Currently, constant propagation won't go through that. Parrayref is never used on anything constant, hence not optimized.
This is somewhat safe as the get primitive is the 'addr' one.

Also I don't know which is the minimal size for this to be beneficial. Also sharing is lost (I don't know if this would really be a problem).

type t = A1 | A2 | ... | A100
let f = function
  | A1 | ... | A50 -> 1
  | A51 | ... | A100 -> 2 

By the way, this is the kind of optimisation I'd rather do after inlining (probably in cmmgen), to benefit from it even if the constants are computed.
This is an instance of an optimization I want to do: implement more efficiently switch and if where each branch does not contain computation (the branch is only a variable, this is the select primitive of llvm), this can get rid of some branches.

Contributor

chambart commented Oct 19, 2016

Currently, constant propagation won't go through that. Parrayref is never used on anything constant, hence not optimized.
This is somewhat safe as the get primitive is the 'addr' one.

Also I don't know which is the minimal size for this to be beneficial. Also sharing is lost (I don't know if this would really be a problem).

type t = A1 | A2 | ... | A100
let f = function
  | A1 | ... | A50 -> 1
  | A51 | ... | A100 -> 2 

By the way, this is the kind of optimisation I'd rather do after inlining (probably in cmmgen), to benefit from it even if the constants are computed.
This is an instance of an optimization I want to do: implement more efficiently switch and if where each branch does not contain computation (the branch is only a variable, this is the select primitive of llvm), this can get rid of some branches.

@stedolan

This comment has been minimized.

Show comment
Hide comment
@stedolan

stedolan Oct 19, 2016

Contributor

@gasche:

  1. This happens in the untyped IR, so whether the constants returned are of the same type is irrelevant.
  2. As @chambart points out, this could inhibit optimisations. I agree that it's being done too early: it's choosing a more opaque representation of matches before Flambda has had time to play with them. I changed it to operate at the Cmm level instead, after any control-flow optimisations have happened.

@chambart:

Now that it's at Cmm level, sharing isn't lost. The generated Cmm array will reference the same complex constant if it's used in multiple cases.

Contributor

stedolan commented Oct 19, 2016

@gasche:

  1. This happens in the untyped IR, so whether the constants returned are of the same type is irrelevant.
  2. As @chambart points out, this could inhibit optimisations. I agree that it's being done too early: it's choosing a more opaque representation of matches before Flambda has had time to play with them. I changed it to operate at the Cmm level instead, after any control-flow optimisations have happened.

@chambart:

Now that it's at Cmm level, sharing isn't lost. The generated Cmm array will reference the same complex constant if it's used in multiple cases.

| Cconst_natint _
| Cconst_float _
| Cconst_symbol _ -> true
| _ -> false in

This comment has been minimized.

@gasche

gasche Oct 19, 2016

Member

I think this definition would make sense in the global scope (although the pointer bit might be too specific).

@gasche

gasche Oct 19, 2016

Member

I think this definition would make sense in the global scope (although the pointer bit might be too specific).

@gasche gasche added the approved label Oct 19, 2016

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 19, 2016

Member

I like the new implementation.

Member

gasche commented Oct 19, 2016

I like the new implementation.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Oct 20, 2016

Contributor

Shouldn't we get some benchmarks for something like this before merging?

Contributor

mshinwell commented Oct 20, 2016

Shouldn't we get some benchmarks for something like this before merging?

@garrigue

This comment has been minimized.

Show comment
Hide comment
@garrigue

garrigue Oct 20, 2016

Contributor

Sure, this kind of thing requires extensive benchmarks, to have a good understanding of what is improved. One easily gets surprises.

Contributor

garrigue commented Oct 20, 2016

Sure, this kind of thing requires extensive benchmarks, to have a good understanding of what is improved. One easily gets surprises.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Oct 20, 2016

Contributor

During benchmarking perhaps we can make some judgement about a possible cut-off threshold for this as well. We should try on more than one architecture.
Removing the approved tag until this has been done, given that @garrigue agrees.

Contributor

mshinwell commented Oct 20, 2016

During benchmarking perhaps we can make some judgement about a possible cut-off threshold for this as well. We should try on more than one architecture.
Removing the approved tag until this has been done, given that @garrigue agrees.

@mshinwell mshinwell removed the approved label Oct 20, 2016

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Dec 4, 2016

Contributor

My compiler writer's guts tell me that this optimization should be beneficial most of the time, and detrimental almost never.

Contributor

xavierleroy commented Dec 4, 2016

My compiler writer's guts tell me that this optimization should be beneficial most of the time, and detrimental almost never.

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Dec 9, 2016

Contributor

I have read the patch and it seems correct.
If it causes trouble, we only have Xavier's guts left to blame.

Contributor

mshinwell commented Dec 9, 2016

I have read the patch and it seems correct.
If it causes trouble, we only have Xavier's guts left to blame.

@mshinwell mshinwell added the approved label Dec 9, 2016

@mshinwell mshinwell merged commit 2ee5242 into ocaml:trunk Dec 9, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 22, 2016

Member

In MPR#7439, user markghayden made exactly this feature request. He gave three examples of switches with constant results, and while the first two are optimized as expected, the last one is not, unless flambda is used.

type t = A | B | C | D ;;

let h v =
  match v with
  | A -> f
  | B -> (fun _ -> 1)
  | C -> (fun _ -> 2)
  | D -> (fun _ -> 3)
;;

Looking at the cmm code generated by the trunk, I see the following:

(function{test.ml:4,8-60} camlTest__f_1204 (param/1206: val)
 (load val (+a (+a "camlTest__12" (<< (or param/1206 1) 2)) -4)))

(function{test.ml:11,8-68} camlTest__g_1207 (param/1209: val)
 (load val (+a (+a "camlTest__11" (<< (or param/1209 1) 2)) -4)))

(function{test.ml:18,6-101} camlTest__h_1210 (v/1211: val)
 (switch (>>s v/1211 1) 
 case 0: (load val "camlTest")
 case 1: "camlTest__8"
 case 2: "camlTest__9"
 case 3: "camlTest__10"))

when flambda is enabled:

(function{test.ml:4,8-60} camlTest__f_10 (param_12/1036: val)
 (load val (+a (+a "camlTest__3" (<< (or param_12/1036 1) 2)) -4)))

(function{test.ml:11,8-68} camlTest__g_20 (param_22/1046: val)
 (load val (+a (+a "camlTest__2" (<< (or param_22/1046 1) 2)) -4)))

(function{test.ml:18,6-101} camlTest__h_34 (v_37/1041: val)
 (load val (+a (+a "camlTest__1" (<< (or v_37/1041 1) 2)) -4)))

I don't think that this is something to worry about. The reason why it is not optimized, if I understand correctly, is the use of f in the first case that gets translated to a lookup of the global module (load val "camlTest"), and which is thus not considered "constant" by the current implementation.

Member

gasche commented Dec 22, 2016

In MPR#7439, user markghayden made exactly this feature request. He gave three examples of switches with constant results, and while the first two are optimized as expected, the last one is not, unless flambda is used.

type t = A | B | C | D ;;

let h v =
  match v with
  | A -> f
  | B -> (fun _ -> 1)
  | C -> (fun _ -> 2)
  | D -> (fun _ -> 3)
;;

Looking at the cmm code generated by the trunk, I see the following:

(function{test.ml:4,8-60} camlTest__f_1204 (param/1206: val)
 (load val (+a (+a "camlTest__12" (<< (or param/1206 1) 2)) -4)))

(function{test.ml:11,8-68} camlTest__g_1207 (param/1209: val)
 (load val (+a (+a "camlTest__11" (<< (or param/1209 1) 2)) -4)))

(function{test.ml:18,6-101} camlTest__h_1210 (v/1211: val)
 (switch (>>s v/1211 1) 
 case 0: (load val "camlTest")
 case 1: "camlTest__8"
 case 2: "camlTest__9"
 case 3: "camlTest__10"))

when flambda is enabled:

(function{test.ml:4,8-60} camlTest__f_10 (param_12/1036: val)
 (load val (+a (+a "camlTest__3" (<< (or param_12/1036 1) 2)) -4)))

(function{test.ml:11,8-68} camlTest__g_20 (param_22/1046: val)
 (load val (+a (+a "camlTest__2" (<< (or param_22/1046 1) 2)) -4)))

(function{test.ml:18,6-101} camlTest__h_34 (v_37/1041: val)
 (load val (+a (+a "camlTest__1" (<< (or v_37/1041 1) 2)) -4)))

I don't think that this is something to worry about. The reason why it is not optimized, if I understand correctly, is the use of f in the first case that gets translated to a lookup of the global module (load val "camlTest"), and which is thus not considered "constant" by the current implementation.

@diml

This comment has been minimized.

Show comment
Hide comment
@diml

diml Dec 23, 2016

Member

I'm interested in this feature as well. With it I could remove this Obj.magic and still get the speed of a simple array lookup:

let feed (type u) (type s) (state : (u, s) state) char (stack : s) : s =
  let idx = (automaton_state state lsl 8) lor (Char.to_int char) in
  (* We need an Obj.magic as the type of the array can't be generalized.
     This problem will go away when we get immutable arrays. *)
  let magic :
    ((u', s') state -> char -> s' -> s') array
    -> ((u , s ) state -> char -> s  -> s ) array =
    Caml.Obj.magic
  in
  (magic transitions).(idx) state char stack
[@@inline always]
Member

diml commented Dec 23, 2016

I'm interested in this feature as well. With it I could remove this Obj.magic and still get the speed of a simple array lookup:

let feed (type u) (type s) (state : (u, s) state) char (stack : s) : s =
  let idx = (automaton_state state lsl 8) lor (Char.to_int char) in
  (* We need an Obj.magic as the type of the array can't be generalized.
     This problem will go away when we get immutable arrays. *)
  let magic :
    ((u', s') state -> char -> s' -> s') array
    -> ((u , s ) state -> char -> s  -> s ) array =
    Caml.Obj.magic
  in
  (magic transitions).(idx) state char stack
[@@inline always]

nojb added a commit to nojb/ocaml that referenced this pull request Dec 29, 2016

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Feb 26, 2017

Member

This optimization is incorrect on Cconst_int values coming from polymorphic variants. I found the issue while investigating the Camomile regression in 4.05 ( yoriyuki/Camomile#21 ), here is minimal repro case:

let test = function
  | 1 -> `Primary
  | 2 -> `Secondary
  | 3 -> `Tertiary
  | n -> invalid_arg "test"

let to_string = function
  | `Primary -> "`Primary"
  | `Secondary -> "`Secondary"
  | `Tertiary -> "`Tertiary"

let () =
  let print n =
    Printf.printf "%d: %s\n" n (to_string (test n)) in
  List.iter print [1;2;3]

Under 4.04 (or 4.05 with the present PR reverted), running this program prints:

1: `Primary
2: `Secondary
3: `Tertiary

Under 4.05, it prints:

1: `Secondary
2: `Secondary
3: `Secondary
Member

gasche commented Feb 26, 2017

This optimization is incorrect on Cconst_int values coming from polymorphic variants. I found the issue while investigating the Camomile regression in 4.05 ( yoriyuki/Camomile#21 ), here is minimal repro case:

let test = function
  | 1 -> `Primary
  | 2 -> `Secondary
  | 3 -> `Tertiary
  | n -> invalid_arg "test"

let to_string = function
  | `Primary -> "`Primary"
  | `Secondary -> "`Secondary"
  | `Tertiary -> "`Tertiary"

let () =
  let print n =
    Printf.printf "%d: %s\n" n (to_string (test n)) in
  List.iter print [1;2;3]

Under 4.04 (or 4.05 with the present PR reverted), running this program prints:

1: `Primary
2: `Secondary
3: `Tertiary

Under 4.05, it prints:

1: `Secondary
2: `Secondary
3: `Secondary
@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Feb 27, 2017

Contributor

I will look at this

Contributor

mshinwell commented Feb 27, 2017

I will look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment