Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify matches that are an affine function of the input #8547

Merged
merged 11 commits into from Apr 9, 2019

Conversation

@smuenzel-js
Copy link
Contributor

commented Mar 26, 2019

This PR converts matches that are affine functions of the input index from a table lookup to a direct computation of the output value.

This is similar to #632, but is vastly simpler (and covers more cases!), because of the changes made in #863, #1068, #1069 to convert matches to table-lookups in certain circumstances.

Isn't the table-lookup optimization already good enough?

This optimization is more cache-efficient, because we don't need to store a table at all.

It can replace uses of Obj.magic, see previous comments on #632 by @damiendoligez:

The array trick suggested by @xavierleroy is more general but gains less in the special case of the identity function.

I have used Obj.magic in the past for this, and I have seen others use it, so I think it's a valuable optimization. I would still like to see @xavierleroy 's idea implemented but I think we should also have this one.

Affine function vs just optimizing identity function

This optimization triggers for all affine functions, because the additional changes compared to just triggering when the match is the identity are small.
This also means that the optimization is a little bit more robust to changes in the output.
Figuring out more complicated functions of the input seems like it's not worth it -- it will be very tricky to get this right, and it's not really clear what kind of functions we'd want to support.

Benchmarks

We attempt to simulate a large program that consists of a bunch of matches and occupies more memory than the cache.
The following code generates a program with many match cases that are called in order, which is an attempt to create cache misses in the table case.

let printf = Printf.printf

let num_matches = 10_000

let () =
  for i = 0 to num_matches do
    printf "let [@inline never] do_match_%i x =\n" i;
    printf "  let x = Sys.opaque_identity x in\n";
    printf "  match x with\n";
    for j = 0 to 200 do
      printf "  | %i -> %i\n" j (j + i)
    done;
    printf "  | _ -> 0\n";
    printf "\n"
  done

let () =
  printf "let () =\n";
  printf "  let result = ref 0 in\n";
  printf "  for i = 0 to 40_000 do\n";
  printf "    let m = i mod %i in\n" (num_matches + 1);
  for i = 0 to num_matches do
    let sign = 1 + ((i mod 2) * (-2)) in
    printf "    result := !result + %i * (do_match_%i m);\n" sign i;
  done;
  printf "  done;\n";
  printf "  print_int !result"

Without the optimization:

4100400/tmp/old.opt  3.60s user 0.01s system 99% cpu 3.606 total
4100400/tmp/old.opt  3.53s user 0.00s system 99% cpu 3.532 total
4100400/tmp/old.opt  3.56s user 0.00s system 99% cpu 3.564 total
4100400/tmp/old.opt  3.53s user 0.00s system 99% cpu 3.531 total

With the optimization:

4100400/tmp/new.opt  3.43s user 0.00s system 99% cpu 3.436 total
4100400/tmp/new.opt  3.45s user 0.00s system 99% cpu 3.453 total
4100400/tmp/new.opt  3.43s user 0.00s system 99% cpu 3.435 total

So we have around a 2% improvement.
The binary size of this specially crafted program is also interesting:

ls -alh /tmp/*.opt
-rwxr-xr-x 1 user group 4.3M Mar 25 23:11 /tmp/new.opt
-rwxr-xr-x 1 user group 21M Mar 25 23:11 /tmp/old.opt

Finally, we can examine cache behaviour:

perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations  -- /tmp/old.opt
 Performance counter stats for '/tmp/old.opt':

       498,978,317      cache-references                                            
           375,006      cache-misses              #    0.075 % of all cache refs    
    12,360,079,541      cycles                                                      
     4,436,940,626      instructions              #    0.36  insn per cycle         
     1,203,697,174      branches                                                    
             4,599      faults                                                      
                 1      migrations                                                  

       3.531722742 seconds time elapsed
perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations  -- /tmp/new.opt
4100400
 Performance counter stats for '/tmp/new.opt':

       389,966,494      cache-references                                            
            56,619      cache-misses              #    0.015 % of all cache refs    
    12,020,269,063      cycles                                                      
     4,421,640,636      instructions              #    0.37  insn per cycle         
     1,202,342,002      branches                                                    
               585      faults                                                      
                 2      migrations                                                  

       3.434637083 seconds time elapsed
const_actions.(act)) cases)));
addr_array_ref (Cconst_symbol (table, dbg)) (tag_int arg dbg) dbg
| Some (offset, slope) ->
let const_natint x dbg =

This comment has been minimized.

Copy link
@smuenzel-js

smuenzel-js Mar 26, 2019

Author Contributor

I'm not really sure about this part, and the distinction between Cconst_natint and Cconst_int.
I added this section instead of just using nativeint because add_int and mul_int have special cases that optimizes uses of Cconst_int

@smuenzel-js smuenzel-js force-pushed the smuenzel-js:identity-matches-late-simplification branch from 131a08a to 8473857 Mar 26, 2019
@lthls

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

It seems that when your optimisation triggers, you're ignoring the cases array which maps arguments to actions. As a result, you should notice errors when either the cases are not sorted, or there is sharing between the cases.

An example which I think would fail:

type t = A | B | C | D

let f = function
| A | D -> 0
| B -> 1
| C -> 2

This should be fixable by iterating over cases instead of const_actions.

@smuenzel-js

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

Ahh yes! I pushed a fix for this. I wasn't able to reproduce with your test case, but I added a GADT case that triggers this bug to the tests.

@gasche

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

Two remarks:

  • It is easy to introduce correctness bugs in these optimizations,
    which quickly offset the performance gains obtained in the good
    cases. We should be super-careful when reviewing the code.

  • The patch makes the make_switch function longer than it was
    before, which hurts readability. Could we try to split it into
    independent auxiliary functions as much as possible, with readable
    names and a nice flow of code?

In an ideal world, the complete flow of the optimization and
code-generation decisions could be understood by the following code,
supported by the right auxiliary function definitions (which could
still be local to the make_switch function; iirc. Stephen had an
explanation on why extract_uconstant makes local assumptions.)

match Misc.Stdlib.Array.map_opt extract_uconstant actions with
| None ->
  Cswitch (arg, cases, actions, dbg)
| Some const_actions ->
  match extract_affine cases const_actions with
  | None -> 
    let table = make_constant_table const_actions dbg in
    make_table_lookup table arg dbg
  | Some (offset, slope) ->
    make_affine_computation ~offset ~slope arg dbg

P.S.: functions of the form a * x + b are usually called "affine"
rather than "linear" -- they are linear when b = 0.

if length >= 2
then begin
match const_actions.(cases.(0)), const_actions.(cases.(1)) with
| Cint v0, Cint v1 ->

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Mar 27, 2019

Contributor

The use of the reference just a Boolean is not so nice.

let slope = Nativeint.sub v1 v0 in
let check i idx = function
  | Cint v -> v = Nativeint.(add v0 (mul (of_int i) slope))
  | _ -> false
in
if Misc.Stdlib.Array.forall_i (fun i idx -> check i const_actions.(idx)) cases then Some (v0, slope)
else None

(and add the definition of Misc.Stdlib.Array.forall_i )

This comment has been minimized.

Copy link
@smuenzel-js

smuenzel-js Mar 28, 2019

Author Contributor

Done, and I added the definition to Stdlib, not Misc.Stdlib, because it seems generally useful.
I have no issue moving it to Misc if that's not a good idea.

This comment has been minimized.

Copy link
@gasche

gasche Mar 28, 2019

Member

I would recommend moving to Misc and then opening a separate PR about the move to the stdlib. Any stdlib addition is going to need a lot of scrutiny and public discussion, you don't want to have this as a dependency of your PR.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

2% speedup on a specially crafted micro-benchmark is not impressive. I don't think this is worth the trouble, unless you have a very specific use case in mind.

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

I think it is a mistake to assess optimisations such as these solely based on a percentage improvement. Individual percentage improvements can be small, but once one has sufficiently many small optimisations, a more notable benefit should be gained that comes above noise in the measurements.

That said, in this particular case, the stats from perf say more about the worthiness of the optimisation than the actual time improvement, IMO.

@Drup

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@alainfrisch Note also the five time size reduction (on a carefully crafted bench, of course), which sounds quite interesting to me.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

@mshinwell I agree on principle with what you said, but the proposed optimization is super specific. Is there any indication that it will trigger in the wild, in code critical enough to justify the ad hoc optimization?

@smuenzel-js

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2019

@mshinwell I agree on principle with what you said, but the proposed optimization is super specific. Is there any indication that it will trigger in the wild, in code critical enough to justify the ad hoc optimization?

I made an instrumented version of the compiler, and in the compiler distribution itself (without tests), we see:
15 Affine
1437 Normal
148 Table

On a subset of Jane Street's code base, we see:
4323 Affine
24575 Normal
12037 Table

I examined some of the matches where it triggers, and it is indeed code that's critical enough for it to matter to us.

Note that for some critical code, the optimization doesn't trigger -- because we've already replaced it by Obj.magic (or it's equivalent). This optimization allows us to stop doing that, getting the benefit of the type checker, while not losing performance.

@gasche

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

Thanks for restructuring the code!

(I haven't done a full review of the code-generation parts yet, but personally I feel good about this PR; @alainfrisch is right to ask about concrete evidence of usefulness (do not hesitate to provide these from the start in your next optimization PR), but the applicability numbers you give are enticing.)

smuenzel-js added 2 commits Mar 28, 2019
(const_natint offset dbg)
dbg
in
fun arg cases actions dbg ->

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Mar 28, 2019

Contributor

It degrades readability to have the arguments of the toplevel make_switch function moved so far away from its name. It would be better to put back the arguments next to make_switch and either keep the helper functions local or make them global.

This comment has been minimized.

Copy link
@smuenzel-js

smuenzel-js Mar 28, 2019

Author Contributor

Fixed with option (1)

| _ -> None
in
let array_option_get array =
try Some (Array.map Option.get array) with | _ -> None

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Mar 28, 2019

Contributor

try ... with _ -> .. should be avoided at all cost (it will happily capture OOM and stack overflow...). This might deserve a new helper function in Misc.Stdlib (Array.all_somes(?): 'a option array -> 'a array option).

This comment has been minimized.

Copy link
@smuenzel-js

smuenzel-js Mar 28, 2019

Author Contributor

I moved the function, and I'm using a local exception (is that too fancy of a language feature for compiler code)?

asmcomp/cmmgen.ml Outdated Show resolved Hide resolved
asmcomp/cmmgen.ml Outdated Show resolved Hide resolved
asmcomp/cmmgen.ml Outdated Show resolved Hide resolved
asmcomp/cmmgen.ml Show resolved Hide resolved
stdlib/array.mli Outdated Show resolved Hide resolved
@smuenzel-js smuenzel-js changed the title Simplify matches that are a linear function of the input Simplify matches that are an affine function of the input Mar 28, 2019
@gasche
gasche approved these changes Mar 28, 2019
Copy link
Member

left a comment

I reviewed the code and believe it is correct.

However, I would like to make sure that @alainfrisch also approves because this is merged. (As mentioned, it's a sensitive piece of code, so it deserves two approvals.)

else false in
loop 0

let all_somes a =

This comment has been minimized.

Copy link
@gasche

gasche Mar 28, 2019

Member

I would prefer map_opt for this function, as an instance of the Haskell function

mapM : Monad m => (a -> m b) -> t a -> m (t b)

which is also called traverse (in the Applicative case), less evocative here.

But this is misc so we can do as we feel, and it looks like you don't feel the same way.

asmcomp/cmmgen.ml Outdated Show resolved Hide resolved
utils/misc.ml Outdated Show resolved Hide resolved
@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2019

The code looks good to me (modulo tiny comment for the latest helper function). I'm still unsure about the practical usefulness of the optimization but won't object to merging since other maintainers are in favor.

Btw, while we are considering such optimizations, one could as well try to avoid arbitrary small-enough tables (for instance, a mapping i=0->a, i=1->b, i=2->c can be implemented with a formula of the form x + i * y + (i/2) * z (x=a, y = b-a, z = c + a - 2 * b; with 1/2=0), perhaps a bit more efficiently than with an indirect memory read. At least such approach would be independent of the ordering of constructors.

@trefis

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

@alainfrisch: let's perhaps open a dedicated issue to discuss that one?
@gasche: should this PR be merged then?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2019

@trefis I did a quick test with 3, and the performance impact was in the noise (I didn't look at cache misses, though). I don't think it is worth discussing that idea further.

@gasche gasche merged commit 5a88aeb into ocaml:trunk Apr 9, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gasche

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

@trefis merged, thanks for the ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.