Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary calls to caml_modify #6019

Closed
vicuna opened this issue May 23, 2013 · 11 comments
Closed

Unnecessary calls to caml_modify #6019

vicuna opened this issue May 23, 2013 · 11 comments

Comments

@vicuna
Copy link

@vicuna vicuna commented May 23, 2013

Original bug ID: 6019
Reporter: bnigito
Status: resolved (set by @xavierleroy on 2013-06-01T07:59:09Z)
Resolution: suspended
Priority: normal
Severity: tweak
Version: 4.00.1
Category: back end (clambda to assembly)
Monitored by: @gasche @diml @ygrek bobot @jmeber yminsky @yakobowski @alainfrisch

Bug description

For some mutation heavy code, the c-call to caml_modify is extremely high on our profiles (exhibiting almost a full order of magnitude difference in tight loops). This is in some sense a duplicate of 0005592, but it highlights the fact that writes could be made cheaper even absent the inlining improvements by a emitting a dynamic check that would skip the C-call to caml_modify.

We've added something like this:

let unsafe_set t i obj =
let old_obj = unsafe_get t i in
if Obj.is_int old_obj && Obj.is_int obj then
(* It is OK to skip [caml_modify] if both the old and new values are integers. *)
Array.unsafe_set (Obj.magic (t : t) : int array) i (Obj.obj obj : int)
else if not (old_obj == obj) then
Array.unsafe_set t i obj

around wrappers of arrays that often hold immediate values with good results. It seems like it makes more sense to perform at the level of code generation. For the included test I've demonstrated it for a polymorphic mutable field, but doing this sort of dynamic check in our code feels somewhat impractical.

Steps to reproduce

$ ocamlopt.opt -inline 20 -nodynlink -g -o z z.ml
$ ./z
int field took 0.120980
poly field took 1.021845
poly field with dynamic check took 0.240963

File attachments

@vicuna
Copy link
Author

@vicuna vicuna commented May 24, 2013

Comment author: @xavierleroy

Yes, caml_modify is too slow, and we should do something about this. Thanks for raising the issue on the bug tracker.

Before inlining a fast path, as you do, I think we need to rewrite caml_modify so that fast paths (there are several possible) are clearly identified. I suspect most of the cost of caml_modify comes primarily from calls to Is_in_heap, which can be avoided in several cases, and not from the call to caml_modify instead.

To be investigated...

@vicuna
Copy link
Author

@vicuna vicuna commented May 24, 2013

Comment author: @gasche

For the record, I modified the proposed example as such to test the performance hit on non-integer modifications.

time_it "addr field" (fun () ->
let t = Poly_field.create [] in
for i = 1 to times do
Poly_field.set t [i]
done);

time_it "addr field with dynamic check" (fun () ->
let t = Poly_field.create [] in
for i = 1 to times do
Poly_field.set_dynamic t [i]
done);

The results on my amd64 machine are the following:

int field took 0.116000
int poly field took 0.940000
int field with dynamic check took 0.468000
addr field took 1.680000
addr field with dynamic check took 2.244000

so this particular microbenchmark must be relativized in that it may speed up integer modifications by 50%, but also slow down the other modifications by 30%. It probably wouldn't be appropriate to blindly inject the corresponding code in all modification places.

@vicuna
Copy link
Author

@vicuna vicuna commented May 27, 2013

Comment author: bnigito

Would you mind running that benchmark again ensuring the resulting set_dynamic code is inlined? I believe that is the only reason you would see a difference of such magnitude (our standard compilation tools already have aggressive inlining options enabled, so I sometimes assume it).

Here's what I see:
addr field took 1.872715
addr field (no-inlining) with dynamic check took 2.492622
addr field (inlined) with dynamic check took 1.929706

Certainly it is a tradeoff, not in all modification sites but only those with a polymorphic value (if this is known/possible at compilation). If it were to slow down polymorphic modifications by 30% I would not have suggested it, but if it's in the range of 3-5%, I think there is a case to consider it.

I also believe the microbenchmark might be understating the benefit, as the an unnecessary caml_modify call could also be evicting useful cache lines, etc.

thanks for looking into it.

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: @xavierleroy

The SVN trunk (r13723) reimplements caml_modify() to reduce overheads and take advantage of several fast paths. On the microbenchmark of this PR, the performance improvement is significant:

                             old implem  new implem

int field took 0.060003 0.056003
poly field took 0.872055 0.568036 <--
poly field with dynamic check took 0.412026 0.412025
addr field took 1.292080 0.888056 <--
addr field with dynamic check took 1.704107 1.340084

In particular, the performance improvement resulting from open-coding the dynamic check is much more modest. Given that the dynamic check is not always a win, I'd rather not pursue this direction.

@vicuna vicuna closed this Jun 1, 2013
@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: yminsky

When trying to build the latest trunk on MacOSX, I get the following error. I'm guessing this is related to the present changes.

rm minor_gc.d.c
ln -s -f memory.c memory.d.c
gcc -c -DCAML_NAME_SPACE -g -DDEBUG -fno-defer-pop -Wall -D_FILE_OFFSET_BITS=64 -D_REENTRANT memory.d.c

stderr

memory.d.c: In function ‘caml_modify’:
memory.d.c:541: error: invalid operands to binary & (have ‘value *’ and ‘int’)
memory.d.c:541: warning: left-hand operand of comma expression has no effect
make[2]: *** [memory.d.o] Error 1
make[1]: *** [coldstart] Error 2
make: *** [world] Error 2

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: yminsky

Hmm. I tried to replicate this outside of OPAM, but the checkout of the specific revision builds without incident, and the current tip breaks in a different place (scanf related). When tip settles down, I'll try again.

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: @xavierleroy

Indeed there was a missing cast, but it only shows up when building the runtime system in debug mode. Fixed in commit r13727.

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: yminsky

I suppose this bug is unrelated?

stderr

File "scanf.ml", line 640, characters 14-75:
Error: This expression has type string -> string
but an expression was expected of type string
make[2]: *** [scanf.cmo] Error 2
make[1]: *** [coldstart] Error 2
make: *** [world] Error 2

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 1, 2013

Comment author: @xavierleroy

Yes it is unrelated. Just look at the committers.

@vicuna
Copy link
Author

@vicuna vicuna commented Jun 6, 2013

Comment author: yminsky

I've tested out the new version, and it's clearly and significantly better, but I think an update of Brian's original benchmark shows that there's still significant room to improve. In particular, the dynamic check still seems like a real win. Here's the benchmark results I see:

First, with the old compiler:

                           int field took: 0.110356
                 poly field with int took: 1.184308
                 poly field with obj took: 1.398438
  poly field with int, dynamic check took: 0.253358
  poly field with obj, dynamic check took: 1.386732

And with the new compiler:

                           int field took: 0.113187
                 poly field with int took: 0.627932
                 poly field with obj took: 0.874185
  poly field with int, dynamic check took: 0.253959
  poly field with obj, dynamic check took: 0.868465

In both cases, it looks to me like the dynamic check wins, and by a decent margin. The fact that the poly-field version with the dynamic check does better by a bit I suspect is just some noise, more indicating that the two are very close than that there's a real advantage. That said, for the int, the poly check seems to be more than 3 times better. I suspect this win comes from avoiding the C call to caml_modify.

My benchmark code is below:

module Int_field = struct
type t = { mutable x : int; }
let create x = { x }
let set t x = t.x <- x
end

module Poly_field = struct
type 'a t = { mutable x : 'a; }
let create x = { x }
let set t x = t.x <- x

let set_dynamic t x =
let old_obj = Obj.repr t.x in
let new_obj = Obj.repr x in
if Obj.is_int old_obj && Obj.is_int new_obj then
Array.unsafe_set (Obj.magic t : int array) 0 (Obj.magic x : int)
else if not (old_obj == new_obj) then
t.x <- x
;;
end

type obj = Obj of int

let time_it desc f =
let start = Sys.time () in
f ();
let elapsed = (Sys.time ()) -. start in
Printf.printf "%40s took: %6f%!\n" desc elapsed
;;

let obj0 = Obj 0
let obj1 = Obj 1

let () =

let times = 200_000_000 in
time_it "int field" (fun () ->
let t = Int_field.create 0 in
for i = 1 to times do
Int_field.set t i;
done);

time_it "poly field with int" (fun () ->
let t = Poly_field.create 0 in
for i = 1 to times do
Poly_field.set t i;
done);

time_it "poly field with obj" (fun () ->
let t = Poly_field.create (Obj 0) in
for i = 1 to times do
Poly_field.set t (if i mod 2 = 0 then obj0 else obj1)
done);

time_it "poly field with int, dynamic check" (fun () ->
let t = Poly_field.create 0 in
for i = 1 to times do
Poly_field.set_dynamic t i;
done);

time_it "poly field with obj, dynamic check" (fun () ->
let t = Poly_field.create (Obj 0) in
for i = 1 to times do
Poly_field.set t (if i mod 2 = 0 then obj0 else obj1)
done);
;;

@vicuna
Copy link
Author

@vicuna vicuna commented Mar 5, 2015

Comment author: @damiendoligez

Note to future visitors: there's a bug in Yaron's latest benchmark: it runs the exact same code for cases 3 and 5, which explains the very close performance numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant