-
Notifications
You must be signed in to change notification settings - Fork 65
Performance improvement #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These tables are not mutated.
It was optimized for the x86 architecture. Modern architectures have plenty of registers, so we can use a simpler loop.
Drup
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice optimisation! It really showcase what you can get with manually tweaked memory representation that removes some key indirections. :)
You really need to isolate the "manual memory tweak" in a submodule and document if you want anyone else to be able to come back to this code (not that many people do ... it's already a bit daunting).
Could you add to the repo your benchmarks (and show the measurements) ?
lib/core.ml
Outdated
|
|
||
| (* Transition table, indexed by color. The state information is stored | ||
| at position 0. *) | ||
| type table = Table of table array [@@unboxed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should isolate table and state in a module, rename them (old state is now pretty much info, and new table should be named state.
Also, I would suggest adding a comment that explains you want a contiguous memory range representing state_info * table, but OCaml can't do that for you without adding an indirection, so you do it manually.
Is there any reason not to inline also idx? It is also accessed in the critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to inline also
idx? It is also accessed in the critical path.
It is not on the critical path, thanks to branch prediction. The processor will speculatively execute the code as if idx >= 0, so it does not have to wait for the actual value of idx.
When executing [let st = st.next.[...]], the initial [st] is still live, so the compiler cannot reuse the register for the new value of [st] and a move instruction will have to be performed. As a hack, we add another parameter to the loop functions that contains the same value as the initial [st] and can be used afterwards. We still have a move to set this parameter, but it is no longer on the critical path.
We replace [let st = st.next.[...]] by [let st = st.[...]]. This removes one memory access on the critical path, making it twice shorter. But we now have to use [Obj.magic] to get information about the current state...
Now that the critical path is much shorter, the bottleneck becomes the number of instructions executed by loop iteration: bound checks are no longer "for free".
|
To measure performances, I use the following code, which spends most of its time traversing a string: let rec repeat n f = if n > 0 then begin f (); repeat (n - 1) f end
let () =
let len = 1000 * 1000 in
let s = String.make len 'a' in
let re = Re.Pcre.regexp "aaaaaaaaaaaaaaaaz" in
let perform () = try ignore (Re.execp re s ~pos:0) with Not_found -> () in
ignore (repeat 1000 perform)Using
So, it's about twice as fast, going from 11 cycles per character to 5.5 cycles per character. |
|
Instead of
I don’t think this is actually true. It is possible to go much faster by fetching larger chunks from memory or by performing memory accesses in parallel. Cryptographic code, for instance, often achieves less than 1 cycle per byte, despite doing significant computation. |
Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
|
Rebased this in #265. Great work! |
This makes the main loops about twice as fast by removing one memory access on the critical path and reducing the number of instructions executed per iteration.
The downside is that we have to use
Obj.magicand unsafe array/string accesses for that.Indeed, we currently have two memory access on the critical path:
let st = st.next.[..]. The idea is to remove the first memory access:let st = st.[...]. But thenstno longer directly contains information on the current state of the automata but instead becomes an array of possible subsequent states, and we have to find a way to get information about the current state somehow. This is achieved by putting this information at index 0 of the array, but we need to useObj.magicfor that.Once the critical path is shortened, it is no longer a bottleneck and we need to reduce the number of instructions executed by iteration as well. Hence the use of unsafe array/string accesses.