Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only bytecode version of executable allocating huge amounts #8699

Open
ojno opened this issue May 29, 2019 · 11 comments

Comments

@ojno
Copy link

commented May 29, 2019

We have an OCaml project ( https://github.com/rems-project/sail ) which tries to allocate huge amounts of memory, but only when compiled to bytecode. The native version works perfectly fine.

strace for an example run ends in:

openat(AT_FDCWD, "/home/ojno/work/sail2/lib/vector_dec.sail", O_RDONLY|O_CLOEXEC) = 3
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "$ifndef _VECTOR_DEC\n$define _VEC"..., 65536) = 7031
mmap(NULL, 1965819695104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x5729f32c4000)                     = 0x55603f2f4000
mmap(NULL, 1965819826176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "Fatal error: out of memory.\n", 28Fatal error: out of memory.
) = 28
exit_group(2)                           = ?
+++ exited with 2 +++

which naturally fails since I don't have 2 TB of memory available.

The stacktrace of the failing mmap of ocamlrun looks like

Breakpoint 1, __GI___mmap64 (addr=addr@entry=0x0, len=1979118821376, 
    prot=prot@entry=3, flags=flags@entry=34, fd=fd@entry=-1, 
    offset=offset@entry=0) at ../sysdeps/unix/sysv/linux/mmap64.c:44
44	in ../sysdeps/unix/sysv/linux/mmap64.c
(gdb) bt
#0  __GI___mmap64 (addr=addr@entry=0x0, len=1979118821376, prot=prot@entry=3, 
    flags=flags@entry=34, fd=fd@entry=-1, offset=offset@entry=0)
    at ../sysdeps/unix/sysv/linux/mmap64.c:44
#1  0x00007ffff708d16b in sysmalloc (nb=nb@entry=1979118686256, 
    av=av@entry=0x7ffff73e4c40 <main_arena>) at malloc.c:2522
#2  0x00007ffff708dff0 in _int_malloc (
    av=av@entry=0x7ffff73e4c40 <main_arena>, bytes=bytes@entry=1979118686240)
    at malloc.c:4125
#3  0x00007ffff70900fc in __GI___libc_malloc (bytes=1979118686240)
    at malloc.c:3057
#4  0x000055555556978a in caml_stat_alloc_aligned_noexc (
    b=<synthetic pointer>, modulo=32, sz=1979118682144) at memory.c:789
#5  caml_alloc_for_heap (request=1979118682112) at memory.c:283
#6  0x0000555555569d9c in expand_heap (request=request@entry=137438797180)
    at memory.c:384
#7  0x0000555555569f98 in caml_alloc_shr_aux (profinfo=0, raise_oom=1, 
    tag=<optimised out>, wosize=137438797180, wosize@entry=124) at memory.c:494
#8  caml_alloc_shr (wosize=wosize@entry=137438797180, tag=<optimised out>)
    at memory.c:575
#9  0x00005555555687cc in caml_oldify_one (v=v@entry=140737337077992, 
    p=p@entry=0x7ffff6d4d508) at minor_gc.c:197
#10 0x00005555555688aa in caml_oldify_one (v=140737337078024, 
    p=p@entry=0x7ffff6d4d508) at minor_gc.c:221
#11 0x000055555556899b in caml_oldify_mopup () at minor_gc.c:304
#12 0x0000555555568b48 in caml_empty_minor_heap () at minor_gc.c:356
#13 0x000055555556904b in caml_gc_dispatch () at minor_gc.c:443
#14 0x0000555555583a8a in caml_interprete (prog=0x7ffff6637010, 
    prog_size=<optimised out>) at interp.c:534
#15 0x0000555555585952 in caml_main (argv=0x7fffffffdd18) at startup.c:419
#16 0x000055555556671c in main (argc=<optimised out>, argv=<optimised out>)
    at main.c:44

so it looks like it's a minor GC which triggers it.

The exact amount of the allocation varies from input to input and even run to run, but is always that approximate size. There's no characteristic pattern of heap or stack running out of control if I turn on memory debugging info in $OCAMLRUNPARAM, just that one huge allocation. When run in ocamldebug, it also doesn't appear to be failing at a location which is sensible to be allocating such amounts of memory -- and the location also varies with the input. And on tiny inputs it doesn't seem to try and make the allocation at all.

I've attached our test case. It can be run simply as ./sail.byte repro.sail. Do let me know if there's any other information which would be helpful.

sail_repro.zip

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

Thanks for the repro. For anyone who wants to play with it:

  • this is an OCaml 4.06.1 bytecode executable
  • it is dynamically linked with "zarith" and with "num"
  • unless your home directory is identical to that of the reporter, the way to run it is ocamlrun ./sail.byte repro.sail.

This said, I can't reproduce:

$ ocamlrun ./sail.byte repro.sail 
Error:
Error when calling smt: Reporting.Fatal_error(_)
@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

The report is consistent with a bad pointer into the minor heap. The minor GC follows the pointer, reads a nonsensical block header with a nonsensical size, and tries to allocate a block of this size in the major heap.

That the problem is seen in bytecode but not in native code can be explained by GC that occurs at different times, and/or GC that follows fewer roots in native code than in bytecode.

Now, where does the bad pointer come from? One possibility is Obj.magic in the source code.

--- Is there any Obj.magic in the source code?

Another possibility is a bug in a C/OCaml binding, e.g. zarith or num.

--- Which versions of zarith and num are you using?
--- Could you upgrade to the latest OCaml (4.06.1 is getting old) and latest zarith and num, then try again?

@ivg

This comment has been minimized.

Copy link

commented May 29, 2019

Installing Z3 will make it reproducible (e.g., on Ubuntu sudo apt-get install z3),

strace ocamlrun ./sail.byte repro.sail
... <snip> ...
read(3, "unsat\n", 65536)               = 6
close(3)                                = 0
close(6)                                = 0
close(7)                                = 0
wait4(20361, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 20361
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=20361, si_uid=1001, si_status=0, si_utime=0, si_stime=0} ---
unlink("/tmp/constraint_5aaf06.smt2")   = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 32000}, ru_stime={0, 12000}, ...}) = 0
mmap(NULL, 1976731639808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x1cc3ef37000)                      = 0x8f6000
mmap(NULL, 1976731770880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fd867f6a000
munmap(0x7fd867f6a000, 614400)          = 0
munmap(0x7fd86c000000, 66494464)        = 0
mprotect(0x7fd868000000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 1976731639808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "Fatal error: out of memory.\n", 28Fatal error: out of memory.
) = 28
exit_group(2)                           = ?
@stedolan

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

I dug into this a bit today. This is in fact an OCaml bug to do with Infix_tag and recursive values, and quite an old one at that: a slight variant of it segfaults on every version of OCaml I've tried, from 3.12.1 up to trunk.

In the reproduction case above, the minor GC explodes after following a pointer that was not initialised correctly by caml_update_dummy. In this example, caml_update_dummy(dummy, newval) is passed a newval whose tag is Infix_tag and whose length is therefore not the length of the object being initialised (as the length field of Infix_tag objects is actually used to store an offset to the real object header).

It turns out that a slight variant of this crashes on both bytecode and native code. Here's a short reproduction case:

let rec foo =
  let rec f = function
    | 0 -> 100
    | n -> foo (n-1)
  and g = function
    | 0 -> 200
    | n -> f (n-1) in
  g

let () = print_int (foo 2)

Interestingly, the native-code version does not use caml_update_dummy, so the fix will not be as simple as checking for Infix_tag in that function.

The short reproduction case above is derived from a comment the function in Sail that caused the original report. It seems that someone on the Sail team ran into this bug before!

cc @chambart who's been looking at recursive value initialisation stuff.

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

Interesting, this seems to be exactly #8682 (fix for #8681). I would not have guessed that the two issues are related.

@stedolan

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

So it is! I missed that PR. What does #8682 do on this example?

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

There are two parts in #8682: an attempt to fix the compilation scheme in this case (which is not correct in the general case; that part would need further work) and a restriction to the static check to reject these dangerous cases (the idea is that all-function let rec have all their declared identifiers considered to "have dynamic size" in the static-vs-dynamic classification). As a result of the second part, this example is statically rejected, which is probably the more correct thing to do today.

@stedolan

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

OK, thanks for the explanation! Since this affects real code, I'd prefer to accept the example and do the right thing. I was discussing the issue with @lpw25 yesterday, and it doesn't seem like the runtime changes needed to support this use are too complicated. I'll make a PR soon.

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 4, 2019

Note that it's easy to fix the real code by eta-expanding the outer function.

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

The issue should have been fixed by #8712, which is now merged -- it should appear in the 4.09 release (not 4.08).

Note that you can already fix the Sail codebase by performing an eta-expansion in the definition of doc_nexp in pretty_print_sail.ml#L91: let rec doc_nexp nexp = let rec .... in nexp0 nexp -- and same for similar letrec-in-letrec definitions.

bacam added a commit to rems-project/sail that referenced this issue Jun 8, 2019
@nojb

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2019

@ojno it would be great if you could check if #8712 fixes the issue in your code, so that we can close this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.