Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Z.of_int as %identity in Zarith (was: zarith sample code segfaults) #6501

Closed
vicuna opened this issue Jul 28, 2014 · 11 comments

Comments

Projects
None yet
1 participant
@vicuna
Copy link

commented Jul 28, 2014

Original bug ID: 6501
Reporter: Antoine Mine
Status: closed (set by @xavierleroy on 2016-12-07T10:34:27Z)
Resolution: fixed
Priority: normal
Severity: feature
Platform: x86_64
OS: Linux
Version: 4.02.0+beta1 / +rc1
Target version: 4.02.1+dev
Fixed in version: 4.03.0+dev / +beta1
Category: back end (clambda to assembly)
Monitored by: avaron yminsky @yakobowski

Bug description

This is an upstream from Zarith's bug #1409
https://forge.ocamlcore.org/tracker/index.php?func=detail&aid=1409&group_id=243&atid=1095

The client code bug.ml (in Zarith's bug report) uses the Z module compiled by Zarith into zarith.cmxa. When compiling bug.ml with ocamlopt and the z.cmx file from the compilation of Zarith is present, the code segfaults in the GC:

#0 0x000000000044fc28 in invert_pointer_at ()
#1 0x000000000044fcb3 in invert_root ()
#2 0x0000000000441ef0 in caml_do_roots ()
#3 0x000000000044fd63 in do_compaction ()
#4 0x00000000004501e8 in caml_compact_heap ()
#5 0x000000000045056d in caml_compact_heap_maybe ()
#6 0x0000000000443507 in caml_major_collection_slice ()
#7 0x0000000000443c71 in caml_minor_collection ()
#8 0x00000000004429f5 in caml_garbage_collection ()
#9 0x000000000045179c in caml_call_gc ()
#10 0x0000000000432147 in camlHashtbl__mem_1128 () at hashtbl.ml:181
#11 0x000000000041054d in camlBug__fun_1093 ()
#12 0x000000000041068b in camlBug__count_1066 ()
#13 0x000000000041057f in camlBug__fun_1093 ()
#14 0x000000000041057f in camlBug__fun_1093 ()
#15 0x00000000004105c6 in camlBug__count_1057 ()
#16 0x000000000041057f in camlBug__fun_1093 ()
#17 0x0000000000410ae5 in camlBug__entry ()
#18 0x000000000040e689 in caml_program ()
#19 0x000000000045197e in caml_start_program ()
#20 0x0000000000000000 in ?? ()

Well, it might be a problem in Zarith, which does magic mixing of OCaml, C and asm.
But now for the weird part. If the z.cmx file is not present when bug.ml is compiled, the program does not segfault. This is always reproducible on my machine. I've checked that the code generated by ocamlopt in both cases is actually quite different.

Steps to reproduce

  1. Download zarith and unpack it
    https://forge.ocamlcore.org/frs/download.php/1199/zarith-1.2.1.tgz

  2. ./configure and make it

  3. download bug.ml and copy it in zarith source
    https://forge.ocamlcore.org/tracker/download.php/243/1095/1409/261/bug.ml

  4. check that z.cmx is in the working directory as well as bug.ml

  5. compile with "ocamlopt zarith.cmxa bug.ml -cclib -L."

  6. ./a.out segfaults

  7. remove z.cmx

  8. recompile with "ocamlopt zarith.cmxa bug.ml -cclib -L."

  9. ./a.out now works

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 29, 2014

Comment author: @diml

When z.cmx is present functions from the Z module might be inlined.

I diffed the two clambda and there are actually very few differences. Only [Z.zero] and [Z.of_int] are inlined.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 29, 2014

Comment author: @yakobowski

The bug disappears (and the code becomes faster :-)) when the following definition is used for memoize

let memoize f =
let cache = Hashtbl.create 100 in
fun n ->
try Hashtbl.find cache n
with Not_found ->
let x = f n in
Hashtbl.add cache n x;
x

However, the real problem probably lies in the loop

  let total = ref Z.zero in
  for fst_size = 0 to size do
    let snd_size = size - fst_size in
    let fst_count = fst_count fst_size in
    let snd_count = snd_count snd_size in
    let count = Z.( * ) fst_count snd_count in
    total := Z.(+) !total count
  done;
  !total

If total is instead initialized to Z.of_int 1, and the code returns 'Z.sub !total (Z.of_int 1)' instead, the segfault disappears. It is likely that 'total' changes from OCaml integers to Zarith internal representation at some point, which confuses either the code generator or the GC.

EDIT: printing the linear code with initial value 0 and 1 shows that very little changes between the two versions. Either the problem is in the GC, or this is the wrong track.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 29, 2014

Comment author: Antoine Mine

Thanks for the bug analysis. Indeed, Z.of_int (and co.) seems to be the culprit. It is defined as %identity, which was probably a bad idea. Redefining it as a C external "ml_z_of_int" solves the issue. I think I will go with this solution and patch Zarith.

Nevertheless, I wonder if there is some relationship with Lazy.t as: 1) bug.ml is the first program I've seen that mixes Lazy and Zarith, 2) I've never had problems with Zarith for a long long time, 3) both do colorful magic mixing values and blocks.

A wild guess is that OCaml performs type-based optimizations (or, more precise, using the abstraction "is a scalar", "is a block", "is polymorphic") and uses the knowledge that a mutable object's initial value is the result of calling %identity on an integer. However, the type of Z.of_int is "int -> t" where t is abstract, which may (should?) prevent type propagation.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @yakobowski

My remark about replacing 'ref Z.zero' by 'ref (Z.of_int 1)' may actually be a red herring. When doing so, the arguments to some GC functions (prior to the crash) do change in a significant way. It is theoretically possible that the switch from OCaml short integers to Zarith long integers occur earlier because of this. This would change the allocation profile of the code.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @diml

The typer doesn't really cares about the fact that "Z.of_int" is declared as an external. This is used by later phases in the compiler, so I don't think that's the problem. The assembly code has very few differences between the inlined and non-inlined versions anyway.

Lazy shouldn't be a problem here. I tried replacing the lazy by a ref and the bug is still there.

I played a bit with the configuration in caml_z.c, setting Z_PERFORM_CHECK to 1 produces a different error:

ml_z_check: wrong custom block for arg2 at ml_z_mul:1339

also reducing the size of the minor heap make the bug happen much faster, for instance with OCAMLRUNPARAM=s=256. The bug always happen some time after the first minor collection, so something might get corrupted during the promotion.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @diml

I spoke a bit too fasst. Looking at -dlive total is indeed considered as always an integer and so ignored by the gc. ocamlobjinfo shows that [Z.zero] is an integer constant and not a pointer constant which is probably the reason.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @xavierleroy

I confirm that the problem is a missing GC root. A bignum gets allocated, then freed by a minor collection, then used again as argument to Z.mul. The typing issue mentioned by "dim" is very likely the culprit. This could be a misuse of "%identity" as mentioned by Antoine, or something else. I'll investigate more.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @xavierleroy

Indeed, the problem lies with "%identity" when it is used to inject integers into a type that can also contain heap pointers. A similar issue can be observed with Obj.magic, which is also "%identity":

let f n =
let r = ref (Obj.magic 0) in (* instead of "ref []" )
for i = 1 to n do
r := String.create 200 :: !r;
ignore (List.length !r) (
scan the list to see if it is consistent *)
done;
()

let _ = f 10000

This segfaults shortly after the first GC, because r is not a root, Why is it not a root? Because it's a local mutable variable (after let-ref optimization) that is initialized with an integer (a C-- Cconst_int, not a C-- Cconst_pointer). So, it has machine type "int" and not "address", and is excluded from the roots given to the GC.

The safe, short-term fix is to stop implementing "Z.of_int" as "%identity" and use a C function instead, as Antoine suggested.

Additionally, Zarith could define

type t' = Zero | One | Other of Obj.t

let zero : t = Obj.magic Zero
let one : t = Obj.magic One

so that zero, one would be constants of machine type address. But it doesn't work for minus_one nor for of_int in general.

In the longer term, there might be better solutions with some changes to the ocamlopt compiler. One would be a primitive like "%identity" that forces its result to have machine type "address". Unlike "%identity", it would have to be kept by the front-end all the way to C--. Another solution would be to treat mutable let-bound variables differently during C-- to Mach compilation, so that their machine type is inferred not just from their initial value but from all their assignments as well.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: @xavierleroy

I reclassify this PR as a feature wish so that it doesn't block the release of OCaml 4.02.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jul 30, 2014

Comment author: Antoine Mine

Thank you very much for your help and the clear explanation. I've patched Zarith to use a C external instead of %identity for of_int. It works well.

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 11, 2014

Comment author: @xavierleroy

Tentative fix in SVN trunk, commit r15078. The machine type of a C-- mutable variable is now the lub of the types of its initial value and all the values assigned to it, with type Int being a subtype of Addr. In the problematic examples from this PR, it causes the mutable variables to have type Addr and be treated as GC roots (correctly).

@vicuna vicuna closed this Dec 7, 2016

@vicuna vicuna added the back-end label Mar 14, 2019

@vicuna vicuna added this to the 4.02.1 milestone Mar 14, 2019

@vicuna vicuna added the feature-wish label Mar 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.