Skip to content

Conversation

@vouillon
Copy link
Member

This makes the main loops about twice as fast by removing one memory access on the critical path and reducing the number of instructions executed per iteration.
The downside is that we have to use Obj.magic and unsafe array/string accesses for that.

Indeed, we currently have two memory access on the critical path: let st = st.next.[..]. The idea is to remove the first memory access: let st = st.[...]. But then st no longer directly contains information on the current state of the automata but instead becomes an array of possible subsequent states, and we have to find a way to get information about the current state somehow. This is achieved by putting this information at index 0 of the array, but we need to use Obj.magic for that.

Once the critical path is shortened, it is no longer a bottleneck and we need to reduce the number of instructions executed by iteration as well. Hence the use of unsafe array/string accesses.

vouillon added 2 commits March 4, 2022 01:16
These tables are not mutated.
It was optimized for the x86 architecture. Modern architectures have
plenty of registers, so we can use a simpler loop.
Copy link
Collaborator

@Drup Drup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice optimisation! It really showcase what you can get with manually tweaked memory representation that removes some key indirections. :)

You really need to isolate the "manual memory tweak" in a submodule and document if you want anyone else to be able to come back to this code (not that many people do ... it's already a bit daunting).

Could you add to the repo your benchmarks (and show the measurements) ?

lib/core.ml Outdated

(* Transition table, indexed by color. The state information is stored
at position 0. *)
type table = Table of table array [@@unboxed]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should isolate table and state in a module, rename them (old state is now pretty much info, and new table should be named state.

Also, I would suggest adding a comment that explains you want a contiguous memory range representing state_info * table, but OCaml can't do that for you without adding an indirection, so you do it manually.

Is there any reason not to inline also idx? It is also accessed in the critical path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to inline also idx? It is also accessed in the critical path.

It is not on the critical path, thanks to branch prediction. The processor will speculatively execute the code as if idx >= 0, so it does not have to wait for the actual value of idx.

vouillon added 2 commits June 3, 2022 15:24
When executing [let st = st.next.[...]], the initial [st] is still live,
so the compiler cannot reuse the register for the new value of [st] and
a move instruction will have to be performed. As a hack, we add another
parameter to the loop functions that contains the same value as the
initial [st] and can be used afterwards. We still have a move to set
this parameter, but it is no longer on the critical path.
vouillon added 2 commits June 3, 2022 18:20
We replace [let st = st.next.[...]] by [let st = st.[...]]. This removes
one memory access on the critical path, making it twice shorter. But we now
have to use [Obj.magic] to get information about the current state...
Now that the critical path is much shorter, the bottleneck becomes the
number of instructions executed by loop iteration: bound checks are no
longer "for free".
@vouillon
Copy link
Member Author

To measure performances, I use the following code, which spends most of its time traversing a string:

let rec repeat n f = if n > 0 then begin f (); repeat (n - 1) f end

let () =
  let len = 1000 * 1000 in
  let s = String.make len 'a' in
  let re = Re.Pcre.regexp "aaaaaaaaaaaaaaaaz" in
  let perform () = try ignore (Re.execp re s ~pos:0) with Not_found -> () in
  ignore (repeat 1000 perform)

Using perf stat, I get the following number of cycles on my laptop:

before after
Re.exec 11 538 887 906 5 811 721 144
Re.execp 11 241 994 212 5 585 367 747

So, it's about twice as fast, going from 11 cycles per character to 5.5 cycles per character.
The best we can expect is 5 cycles per character (L1 cache latency), but it seems the instruction scheduling is not optimal and the memory access gets delayed about half of the time.

@DemiMarie
Copy link

Instead of Obj.magic, another option would be to use Obj.field to get the first field of the array.

The best we can expect is 5 cycles per character (L1 cache latency)

I don’t think this is actually true. It is possible to go much faster by fetching larger chunks from memory or by performing memory accesses in parallel. Cryptographic code, for instance, often achieves less than 1 cycle per byte, despite doing significant computation.

rgrinberg added a commit that referenced this pull request Apr 17, 2024
Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
rgrinberg added a commit that referenced this pull request Apr 17, 2024
Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
@rgrinberg
Copy link
Member

Rebased this in #265. Great work!

@rgrinberg rgrinberg closed this Apr 18, 2024
rgrinberg added a commit that referenced this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants