Performance improvement #210

vouillon · 2022-05-31T11:12:47Z

This makes the main loops about twice as fast by removing one memory access on the critical path and reducing the number of instructions executed per iteration.
The downside is that we have to use Obj.magic and unsafe array/string accesses for that.

Indeed, we currently have two memory access on the critical path: let st = st.next.[..]. The idea is to remove the first memory access: let st = st.[...]. But then st no longer directly contains information on the current state of the automata but instead becomes an array of possible subsequent states, and we have to find a way to get information about the current state somehow. This is achieved by putting this information at index 0 of the array, but we need to use Obj.magic for that.

Once the critical path is shortened, it is no longer a bottleneck and we need to reduce the number of instructions executed by iteration as well. Hence the use of unsafe array/string accesses.

These tables are not mutated.

It was optimized for the x86 architecture. Modern architectures have plenty of registers, so we can use a simpler loop.

Drup

That's a nice optimisation! It really showcase what you can get with manually tweaked memory representation that removes some key indirections. :)

You really need to isolate the "manual memory tweak" in a submodule and document if you want anyone else to be able to come back to this code (not that many people do ... it's already a bit daunting).

Could you add to the repo your benchmarks (and show the measurements) ?

lib/core.ml

Drup · 2022-05-31T12:39:30Z

lib/core.ml


+(* Transition table, indexed by color. The state information is stored
+   at position 0. *)
+type table = Table of table array [@@unboxed]


I think you should isolate table and state in a module, rename them (old state is now pretty much info, and new table should be named state.

Also, I would suggest adding a comment that explains you want a contiguous memory range representing state_info * table, but OCaml can't do that for you without adding an indirection, so you do it manually.

Is there any reason not to inline also idx? It is also accessed in the critical path.

Is there any reason not to inline also idx? It is also accessed in the critical path.

It is not on the critical path, thanks to branch prediction. The processor will speculatively execute the code as if idx >= 0, so it does not have to wait for the actual value of idx.

When executing [let st = st.next.[...]], the initial [st] is still live, so the compiler cannot reuse the register for the new value of [st] and a move instruction will have to be performed. As a hack, we add another parameter to the loop functions that contains the same value as the initial [st] and can be used afterwards. We still have a move to set this parameter, but it is no longer on the critical path.

We replace [let st = st.next.[...]] by [let st = st.[...]]. This removes one memory access on the critical path, making it twice shorter. But we now have to use [Obj.magic] to get information about the current state...

Now that the critical path is much shorter, the bottleneck becomes the number of instructions executed by loop iteration: bound checks are no longer "for free".

vouillon · 2022-09-13T15:56:21Z

To measure performances, I use the following code, which spends most of its time traversing a string:

let rec repeat n f = if n > 0 then begin f (); repeat (n - 1) f end

let () =
  let len = 1000 * 1000 in
  let s = String.make len 'a' in
  let re = Re.Pcre.regexp "aaaaaaaaaaaaaaaaz" in
  let perform () = try ignore (Re.execp re s ~pos:0) with Not_found -> () in
  ignore (repeat 1000 perform)

Using perf stat, I get the following number of cycles on my laptop:

	before	after
Re.exec	11 538 887 906	5 811 721 144
Re.execp	11 241 994 212	5 585 367 747

So, it's about twice as fast, going from 11 cycles per character to 5.5 cycles per character.
The best we can expect is 5 cycles per character (L1 cache latency), but it seems the instruction scheduling is not optimal and the memory access gets delayed about half of the time.

DemiMarie · 2022-11-01T02:42:05Z

Instead of Obj.magic, another option would be to use Obj.field to get the first field of the array.

The best we can expect is 5 cycles per character (L1 cache latency)

I don’t think this is actually true. It is possible to go much faster by fetching larger chunks from memory or by performing memory accesses in parallel. Cryptographic code, for instance, often achieves less than 1 cycle per byte, despite doing significant computation.

Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

rgrinberg · 2024-04-18T21:27:50Z

Rebased this in #265. Great work!

Rebase #210

vouillon added 2 commits March 4, 2022 01:16

Color maps: Switch from bytes to string

63af3bb

These tables are not mutated.

Simplify main loop

cfcf332

It was optimized for the x86 architecture. Modern architectures have plenty of registers, so we can use a simpler loop.

Drup reviewed May 31, 2022

View reviewed changes

vouillon force-pushed the perf branch from c7d6f63 to 2337910 Compare June 2, 2022 12:32

vouillon added 2 commits June 3, 2022 15:24

Main loops: use more parameters rather than accessing [info]

2135159

vouillon force-pushed the perf branch from 2337910 to 06f331d Compare June 3, 2022 13:35

vouillon added 2 commits June 3, 2022 18:20

Shorten critical path

3768f0a

We replace [let st = st.next.[...]] by [let st = st.[...]]. This removes one memory access on the critical path, making it twice shorter. But we now have to use [Obj.magic] to get information about the current state...

Remove bound checks in main loops

092fba2

Now that the critical path is much shorter, the bottleneck becomes the number of instructions executed by loop iteration: bound checks are no longer "for free".

vouillon force-pushed the perf branch from 06f331d to 092fba2 Compare June 3, 2022 16:22

vouillon mentioned this pull request Jan 3, 2023

Unboxed types (version 2) ocaml/RFCs#34

Open

rgrinberg force-pushed the master branch from 47db964 to de7057c Compare April 11, 2024 20:28

rgrinberg added a commit that referenced this pull request Apr 17, 2024

bench: add benchmark from #210

0c90723

Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

rgrinberg added a commit that referenced this pull request Apr 17, 2024

bench: add benchmark from #210 (#260)

e2007ef

Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

This was referenced Apr 17, 2024

Main loops: use more parameters rather than accessing [info] #259

Merged

refactor: remove bound checks #264

Closed

rgrinberg closed this Apr 18, 2024

rgrinberg added a commit that referenced this pull request Apr 18, 2024

Merge pull request #265 from ocaml/perf-rebase

e1fe232

Rebase #210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvement #210

Performance improvement #210

Uh oh!

vouillon commented May 31, 2022

Uh oh!

Drup left a comment

Uh oh!

Uh oh!

Drup May 31, 2022

Uh oh!

vouillon Jun 2, 2022

Uh oh!

vouillon commented Sep 13, 2022

Uh oh!

DemiMarie commented Nov 1, 2022

Uh oh!

rgrinberg commented Apr 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Performance improvement #210

Performance improvement #210

Uh oh!

Conversation

vouillon commented May 31, 2022

Uh oh!

Drup left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Drup May 31, 2022

Choose a reason for hiding this comment

Uh oh!

vouillon Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

vouillon commented Sep 13, 2022

Uh oh!

DemiMarie commented Nov 1, 2022

Uh oh!

rgrinberg commented Apr 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants