Skip to content

Keep information about allocation sizes, for statmemprof, and use during GC. #8805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 6, 2019

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Jul 16, 2019

This patch makes some subtle changes to how allocations work that are needed for statmemprof, as recently discussed with @damiendoligez and @jhjourdan. The main change is to make information about allocation sizes available to the GC, and to make native-code allocations work like bytecode allocations. (Bytecode allocations are simpler, but require size information that was not previously available).


In bytecode mode, small allocations on the minor heap are straightforward, following roughly this pattern:

caml_young_ptr -= Whsize_wosize(wosize);
if (caml_young_ptr < caml_young_limit) {
  caml_alloc_small_dispatch(tag, wosize, CAML_DO_TRACK | CAML_FROM_CAML);
}
Hd_hp(caml_young_ptr) = header;

The function caml_alloc_small_dispatch undoes the failed allocation, then does a minor GC and runs any pending signal handlers / finalisers, and retries the allocation. (It may have to do this several times, as it's possible for a signal handler / finaliser to re-fill the minor heap). Finally, it logs the now-successful allocation with statmemprof, if needed.

Note that this logic requires knowing how big the allocation is.

In native code, allocations are more complicated. They look like this:

redo:
caml_young_ptr -= Whsize_wosize(wosize);
if (caml_young_ptr < caml_young_limit) goto gc;
Hd_hp(caml_young_ptr) = header;
...

gc:
caml_young_ptr += Whsize_wosize(wosize); // [*]
caml_garbage_collection();
goto redo

The function caml_garbage_collection does not know how big the allocation is. Instead, it ensures that there is enough space on the minor heap for the largest possible allocation (256 words). Not passing the allocation size translates to a small saving in code size. (The line [*] was added by #8619: without it, failed allocations take twice the space they should, throwing off GC stats.)

For statmemprof to work, it needs to know how big the allocations are. To avoid affecting code size, this patch extends the frame table with a compact representation of allocation size, so that caml_garbage_collection can determine the allocation size without needing to be passed it explicitly.

However, simply encoding the allocation size information is not quite enough. The problem is a subtle one: the allocation that we sample during caml_garbage_collection may not be the next one to be allocated. The issue is the redo label: it comes before the young_limit check, so after caml_garbage_collection returns there's a new opportunity to run signal handlers and finalisers. This means that the following sequence of events is possible:

caml_young_ptr -= 3; // try to allocate a 2-word block, e.g. cons cell
if (caml_young_ptr < caml_young_limit) goto gc; // allocation fails
gc:
caml_young_ptr += 3; // undo failed allocation
caml_garbage_collection(); // statmemprof samples about-to-be-allocated cell
goto redo;
*** SIGINT received, caml_young_limit updated ***
caml_young_ptr -= 3; // retry allocation of cons cell
if (caml_young_ptr < caml_young_limit) goto gc; // allocation fails due to signal
gc:
caml_young_ptr += 3; // undo failed allocation
caml_garbage_collection(); // runs signal handler
// signal handler allocates, say, a 1-word ref.

The data maintained by statmemprof becomes corrupted, because the next allocation is not the one that statmemprof was expecting to happen. The problem is due to the extra polling of signals that happens between caml_garbage_collection and the actual allocation, and the fix is to make allocations work the same way they do on bytecode:

caml_young_ptr -= Whsize_wosize(wosize);
if (caml_young_ptr < caml_young_limit) goto gc;
gc_done:
Hd_hp(caml_young_ptr) = header;

gc:
caml_garbage_collection();
goto gc_done

This way, once caml_garbage_collection returns the space has definitely been allocated, and natively-compiled code need not redo the check. Internally, this calls the same caml_alloc_small_dispatch function used for bytecode allocations.


The code used to record allocation information is lifted from 2c93ca1. (@jhjourdan: since I pretty much copy-pasted your approach, I've added you as an author to the Changes file of this PR). The tricky bit here is keeping track of allocation sizes through Comballoc, which may combine several allocations into one.

This requires storing a lot more information in the frame tables, and generating more debug info. To keep this small, I've adapted the compact debuginfo format from #8637, which this PR supersedes.


There are a few bits left to do here:

Left for a later PR:

  • update the statmemprof tests to be aware of native-code allocations
  • use the new debug info in statmemprof to track inside comballoc'd blocks
  • either remove tags from the statmemprof API, or pass them from native blocks

@jhjourdan
Copy link
Contributor

Thanks, @stedolan, for doing this work. This patch is particularly subtle.

I will do a thorough review sometime this week. Before this, I already have a few remarks:

  • Please add a FIXME somewhere so that we remember that Comballoc is not yet supported in the memprof runtime code. Currently, only the first block of a combined allocation is sampled, at a rate which is proportional to the full combined allocation. This is clearly not the right thing to do (but OK for this patch).
  • In the tests, could you grep for "FIXME: we should use 1", and use 1 instead of 0.5 in the corresponding tests? This patch should fix this issue.
  • A call to caml_spacetime_automatic_snapshoupt(); has been dropped in caml_garbage_collection. I am not sure how spacetime uses this, but dropping the call is probably not the right thing to do.
  • We should anticipate the representation of a raw backtrace slot corresponding to an allocation which is not the first of a combined block. These allocations correspond to the same entry in the frame table, but to different debug info. Hence, the current approach of using a pointer to the frame table as a backtrace_slot value does no longer work. One possible approach would be to use a debuginfo pointer instead of a frame table pointer. But then, we can no longer capture the call stack when there is no debug info. Also, capturing the call stack will require more decoding of the frame table, which may have some performance impact (?). Another possibility would be to have a low-order bit of a backtrace_slot determine whether this is a frame table pointer or a debuginfo pointer....

@stedolan
Copy link
Contributor Author

Thanks!

  • Please add a FIXME somewhere so that we remember that Comballoc is not yet supported in the memprof runtime code. Currently, only the first block of a combined allocation is sampled, at a rate which is proportional to the full combined allocation. This is clearly not the right thing to do (but OK for this patch).
  • In the tests, could you grep for "FIXME: we should use 1", and use 1 instead of 0.5 in the corresponding tests? This patch should fix this issue.

OK, I'll fix these two.

  • A call to caml_spacetime_automatic_snapshoupt(); has been dropped in caml_garbage_collection. I am not sure how spacetime uses this, but dropping the call is probably not the right thing to do.

The call occurred just after caml_gc_dispatch, which has been replaced with a call to caml_alloc_small_dispatch. So, I moved it to caml_alloc_small_dispatch, guarded by NATIVE_CODE and WITH_SPACETIME.

  • We should anticipate the representation of a raw backtrace slot corresponding to an allocation which is not the first of a combined block. These allocations correspond to the same entry in the frame table, but to different debug info. Hence, the current approach of using a pointer to the frame table as a backtrace_slot value does no longer work. One possible approach would be to use a debuginfo pointer instead of a frame table pointer. But then, we can no longer capture the call stack when there is no debug info. Also, capturing the call stack will require more decoding of the frame table, which may have some performance impact (?). Another possibility would be to have a low-order bit of a backtrace_slot determine whether this is a frame table pointer or a debuginfo pointer....

I've been thinking along the same lines, but haven't come to any conclusions about which of these is the cleanest representation. (I think this should be fixed as part of the following PR that makes memprof use the new comballoc data).

@gasche
Copy link
Member

gasche commented Jul 16, 2019

Would it be possible to have the frametable compression as a separate preliminary PR? This sounds easier to review independently of the subtle native-allocation-mode changes.

@stedolan
Copy link
Contributor Author

Would it be possible to have the frametable compression as a separate preliminary PR? This sounds easier to review independently of the subtle native-allocation-mode changes.

Is that an offer to review this part?

It was open for a while as #8637. This part hasn't changed much since then, so I could split it back off and reopen that PR. But I'd rather avoid the (significant!) busywork of maintaining two dependent PRs simultaneously unless it will definitely speed up review.

@gasche
Copy link
Member

gasche commented Jul 16, 2019

I would be happy to help but I can't commit reviewing resources right now, so please do as you think is best. (Significant busywork? Meh.)

@stedolan stedolan force-pushed the statmemprof-comballoc-native branch from b49f09d to bb9a914 Compare July 16, 2019 15:24
@stedolan
Copy link
Contributor Author

I've split this PR and reopened #8637 with the frametable optimisation part.

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first review of this code and it looks good so far.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just made a thorough review: this looks very good, apart from a few minor comments.

use the new debug info in statmemprof to track inside comballoc'd blocks

We could do that in a different PR, don't you think? I have the feeling that this will involve not-so-trivial changes in the Statmemprof code.

either remove tags from the statmemprof API, or pass them from native blocks

I guess this will depend on your measurements of the overhead of the new frame table format. An alternative is to provide this piece of information only when the block gets promoted using the dedicated API, since we agreed we would add such a callback anyway.

@jhjourdan
Copy link
Contributor

FTR, the arrays_in_minor test is failing because of the wrong tag 0 which is passed to memprof by the native runtime. I think for now the simplest fix is to disable statmemprof for blocks allocated by native code (it's anyway temporarily bogus in many ways). Just don't pass the CAML_DO_TRACK flag when calling caml_alloc_small_dispatch in caml_garbage_collection.

@stedolan
Copy link
Contributor Author

Thanks for the feedback! These changes sound good.

I was hoping to update this PR today, but ended up being too busy. I'm afraid this means it'll be a couple of weeks before I can spend any time on this, because I'm going on holidays tomorrow.

@stedolan
Copy link
Contributor Author

(@jhjourdan: if the lack of these changes is blocking you over the next couple of weeks, feel free to push to this branch. I've added you as a committer on stedolan/ocaml, so you should be able to push to this branch)

@stedolan stedolan force-pushed the statmemprof-comballoc-native branch from 864c945 to a3cf326 Compare September 17, 2019 15:46
This was referenced Oct 9, 2019
@stedolan stedolan force-pushed the statmemprof-comballoc-native branch 2 times, most recently from 3d9e9fa to 2008484 Compare October 16, 2019 13:46
@stedolan
Copy link
Contributor Author

Current status: now all of the architectures build and generate the new format of debug info and frametables, but since I haven't updated the allocation stubs for most architectures yet, non-amd64 still doesn't work.

Since e356a2d, there's now enough testing of the new allocation-size information that I'm confident that a run of the testsuite under the debug runtime will find any problems. (In particular, the minor GC now has various assertions that check young_ptr upon entry, which detects bad allocation lengths in the frametables).

Once I've gotten the other architectures fixed up, I need to test them. Does anyone know how to make inria precheck run the testsuite with a debug runtime (i.e. with USE_RUNTIME=d)?

@jhjourdan
Copy link
Contributor

Does anyone know how to make inria precheck run the testsuite with a debug runtime (i.e. with USE_RUNTIME=d)?

I don't know whether there is an intended way of doing that. You can always run it on a custom branch which sets USE_RUNTIME=d manually in the makefile.

@stedolan stedolan force-pushed the statmemprof-comballoc-native branch from 2008484 to 6891973 Compare October 22, 2019 08:43
stedolan and others added 5 commits October 22, 2019 11:46
Locations of inlined frames are now represented as contiguous
sequences rather than linked lists.

The frame tables now refer to debug info by 32-bit offset rather
than word-sized pointer.
This code is adapted from jhjourdan's 2c93ca1. Comballoc is
extended to keep track of allocation sizes and debug info for each
allocation, and the frame table format is modified to store them.

The native code GC-entry logic is changed to match bytecode, by
calling the garbage collector at most once per allocation.

amd64 only, for now.
Co-Authored-By: Damien Doligez <damien.doligez@gmail.com>
In debug builds, the minor GC now asserts that young_ptr points to
a valid minor heap header before starting GC. Since very few bit
patterns are valid minor heap headers, this is unlikely to be true
by coincidence.

This patch also ensures that minor allocations have color 0. This
was inconsistent between backends before.
Moves the alloc_dbginfo type to Debuginfo, to avoid a circular
dependency on architectures that use Branch_relaxation.

This commit generates frame tables with allocation sizes on all
architectures, but does not yet update the allocation code for
non-amd64 backends.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N}
       by tail-calling caml_call_gc.

i386:  simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       these functions do not need to preserve ebx.

arm:   simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc.
       partial revert of ocaml#8619.

power: partial revert of ocaml#8619.
       avoid restarting allocation sequence after failure.

s390:  partial revert of ocaml#8619.
       avoid restarting allocation seqeunce after failure.
@stedolan stedolan force-pushed the statmemprof-comballoc-native branch from efca189 to 7fe3604 Compare October 23, 2019 08:24
@stedolan
Copy link
Contributor Author

I followed @jhjourdan's suggestion, and this branch has now passed precheck with a debug runtime (precheck 314). Below are some stats on frame table / debuginfo sizes. tl;dr: frame table sizes don't change much, debuginfo sizes vary a bit more widely but on average get smaller (the better format usually outweighs the greater amount).

Here are some stats for ocamlopt.opt:

binary size frametable debuginfo
trunk 11717 kB 1311 kB 578 kB
this PR 11651 kB 1341 kB 513 kB
change -0.5% +2% -11%

These sizes are affected by the optimisations in #8637 (which are included in this PR) and the fact that this PR adds more information to the tables. Frametables for allocation points have more information (now including allocation size information and debuginfo), but the debuginfo pointers are now half the size. Much more debuginfo is generated, but the more compact format of #8637 means each debuginfo entry takes less space.

The combination of these effects means very little overall change in frametable sizes. The frametable and debuginfo sizes of the 263 modules linked into ocamlopt.opt are plotted below (raw data here). The plot uses a log scale, and each tick in the x-axis means 5% larger than the previous tick. The box is the interquartile range (25th to 75th percentiles) and the line is the median.

frametables

There are 4 outliers (Ast_helper, Convert_primitives, Stdlib__option, X86_dsl) not shown in the debuginfo part of this graph as their debuginfo grew by more than 2.5x. These add up to 1.3kB of debuginfo in trunk vs 5.4kb in this PR, which is a large change but still a small absolute number. I suspect these are modules with many allocations but few calls, so the amount of new debuginfo dominates. There are no modules with such outlying frame table size changes.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a review of the architecture dependent code, which seems good to me. I added a few minor comments, which are mostly minor optimization suggestions.

Note, however, that I know almost nothing about architectures other than amd64/i386, so I cannot guarantee anything.

Comment on lines -660 to -662
let tmp = if i.res.(0).loc = Reg 8 (* r12 *) then phys_reg 7 (* r7 *)
else phys_reg 8 (* r12 *)
in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change mean that we can now declarer r12 as being preserved by allocations in proc.ml?

This is also the case of r7 if fast_code_flag is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. While the OCaml stubs no longer use r12, it's not preserved across procedure calls as the linker may insert stubs that clobber it. See "Use of IP by the linker" on page 22 of "Procedure Call Standard for the Arm Architecture" and PR #1304, which fixed a bug along these lines for amd64 (where %r10, %r11 might be clobbered by the linker)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. That's particularly subtle.

/* Record lowest stack address and return address */
pushl %ebx; CFI_ADJUST(4)
movl G(Caml_state), %ebx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't %ebx already contain Caml_state at that point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right!

I've updated the other occurrence below (and done the same in i386nt.asm), but I'd prefer to leave this one as is. Removing it will make no perf difference (it's a cheap instruction on a codepath that does a lot of work), but makes the calling convention of caml_call_gc weirder.

runtime/i386.S Outdated
@@ -150,108 +147,59 @@ LBL(105):
popl %esi; CFI_ADJUST(-4)
popl %edi; CFI_ADJUST(-4)
popl %ebp; CFI_ADJUST(-4)
/* Return to caller */
/* Return to caller. Returns young_ptr in %eax. */
movl G(Caml_state), %eax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do this: %ebx already contains Caml_state here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -39,22 +39,19 @@ INCLUDE domain_state32.inc

_caml_call_gc:
; Record lowest stack address and return address
push ebx ; make a tmp reg
mov ebx, _Caml_state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@@ -66,88 +63,50 @@ L105: push ebp
pop esi
pop edi
pop ebp
; Return to caller
; Return to caller. Returns young_ptr in eax
mov eax, _Caml_state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@stedolan
Copy link
Contributor Author

stedolan commented Nov 4, 2019

@xavierleroy: any chance you could have a look at the power and s390 backend changes in 7fe3604?

@xavierleroy
Copy link
Contributor

I had a quick look at the PowerPC and s390 changes. To summarize:

  • (Revert) When caml_call_gc is entered, the allocation pointer has been decremented by the allocation size already.
  • (New) The allocation sequence is not restarted, so we rely on caml_call_gc to return with the allocation pointer pointing to a freshly-allocated block of the right size.

I think this is what you wanted to do, and I think you've achieved it.

@stedolan
Copy link
Contributor Author

stedolan commented Nov 4, 2019

Great! I think that means that every part of this PR has been reviewed by someone, so I think this is ready to merge.

@gasche gasche merged commit 92bfafc into ocaml:trunk Nov 6, 2019
@jhjourdan
Copy link
Contributor

Thanks, @gasche and @stedolan!

stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 6, 2020
Keep information about allocation sizes, for statmemprof, and use during GC.

(cherry picked from commit 92bfafc)
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 17, 2020
…mmit 92bfafc)

Keep information about allocation sizes, for statmemprof, and use during GC.
mshinwell pushed a commit to mshinwell/ocaml that referenced this pull request Apr 7, 2020
…mmit 92bfafc)

Keep information about allocation sizes, for statmemprof, and use during GC.
@gasche
Copy link
Member

gasche commented Apr 18, 2020

Cross-linking: @nojb identified a regression in trunk due to the use of Compilenv.make_symbol in this PR, see #9641.

@xavierleroy
Copy link
Contributor

Analysis and proposed fix here: #9461 (comment)

@xavierleroy
Copy link
Contributor

It would be good to have a test: file 0-!@#$%^.ml can be compiled but produces a warning.

@nojb
Copy link
Contributor

nojb commented Apr 18, 2020

It would be good to have a test: file 0-!@#$%^.ml can be compiled but produces a warning.

Fix in #9465 . I added the test as suggested but without $ as that character has a special meaning in ocamltest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants