Join GitHub today
Simplify matches that are an affine function of the input #8547
This PR converts matches that are affine functions of the input index from a table lookup to a direct computation of the output value.
Isn't the table-lookup optimization already good enough?
This optimization is more cache-efficient, because we don't need to store a table at all.
Affine function vs just optimizing identity function
This optimization triggers for all affine functions, because the additional changes compared to just triggering when the match is the identity are small.
We attempt to simulate a large program that consists of a bunch of matches and occupies more memory than the cache.
let printf = Printf.printf let num_matches = 10_000 let () = for i = 0 to num_matches do printf "let [@inline never] do_match_%i x =\n" i; printf " let x = Sys.opaque_identity x in\n"; printf " match x with\n"; for j = 0 to 200 do printf " | %i -> %i\n" j (j + i) done; printf " | _ -> 0\n"; printf "\n" done let () = printf "let () =\n"; printf " let result = ref 0 in\n"; printf " for i = 0 to 40_000 do\n"; printf " let m = i mod %i in\n" (num_matches + 1); for i = 0 to num_matches do let sign = 1 + ((i mod 2) * (-2)) in printf " result := !result + %i * (do_match_%i m);\n" sign i; done; printf " done;\n"; printf " print_int !result"
Without the optimization:
With the optimization:
So we have around a 2% improvement.
Finally, we can examine cache behaviour:
It seems that when your optimisation triggers, you're ignoring the
An example which I think would fail:
type t = A | B | C | D let f = function | A | D -> 0 | B -> 1 | C -> 2
This should be fixable by iterating over
In an ideal world, the complete flow of the optimization and
P.S.: functions of the form
I think it is a mistake to assess optimisations such as these solely based on a percentage improvement. Individual percentage improvements can be small, but once one has sufficiently many small optimisations, a more notable benefit should be gained that comes above noise in the measurements.
That said, in this particular case, the stats from perf say more about the worthiness of the optimisation than the actual time improvement, IMO.
I made an instrumented version of the compiler, and in the compiler distribution itself (without tests), we see:
On a subset of Jane Street's code base, we see:
I examined some of the matches where it triggers, and it is indeed code that's critical enough for it to matter to us.
Note that for some critical code, the optimization doesn't trigger -- because we've already replaced it by Obj.magic (or it's equivalent). This optimization allows us to stop doing that, getting the benefit of the type checker, while not losing performance.
Thanks for restructuring the code!
(I haven't done a full review of the code-generation parts yet, but personally I feel good about this PR; @alainfrisch is right to ask about concrete evidence of usefulness (do not hesitate to provide these from the start in your next optimization PR), but the applicability numbers you give are enticing.)
The code looks good to me (modulo tiny comment for the latest helper function). I'm still unsure about the practical usefulness of the optimization but won't object to merging since other maintainers are in favor.
Btw, while we are considering such optimizations, one could as well try to avoid arbitrary small-enough tables (for instance, a mapping i=0->a, i=1->b, i=2->c can be implemented with a formula of the form x + i * y + (i/2) * z (x=a, y = b-a, z = c + a - 2 * b; with 1/2=0), perhaps a bit more efficiently than with an indirect memory read. At least such approach would be independent of the ordering of constructors.