-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Keep information about allocation sizes, for statmemprof, and use during GC. #8805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a67f9b
to
b49f09d
Compare
Thanks, @stedolan, for doing this work. This patch is particularly subtle. I will do a thorough review sometime this week. Before this, I already have a few remarks:
|
Thanks!
OK, I'll fix these two.
The call occurred just after caml_gc_dispatch, which has been replaced with a call to
I've been thinking along the same lines, but haven't come to any conclusions about which of these is the cleanest representation. (I think this should be fixed as part of the following PR that makes memprof use the new comballoc data). |
Would it be possible to have the frametable compression as a separate preliminary PR? This sounds easier to review independently of the subtle native-allocation-mode changes. |
Is that an offer to review this part? It was open for a while as #8637. This part hasn't changed much since then, so I could split it back off and reopen that PR. But I'd rather avoid the (significant!) busywork of maintaining two dependent PRs simultaneously unless it will definitely speed up review. |
I would be happy to help but I can't commit reviewing resources right now, so please do as you think is best. (Significant busywork? Meh.) |
b49f09d
to
bb9a914
Compare
I've split this PR and reopened #8637 with the frametable optimisation part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first review of this code and it looks good so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just made a thorough review: this looks very good, apart from a few minor comments.
use the new debug info in statmemprof to track inside comballoc'd blocks
We could do that in a different PR, don't you think? I have the feeling that this will involve not-so-trivial changes in the Statmemprof code.
either remove tags from the statmemprof API, or pass them from native blocks
I guess this will depend on your measurements of the overhead of the new frame table format. An alternative is to provide this piece of information only when the block gets promoted using the dedicated API, since we agreed we would add such a callback anyway.
FTR, the |
Thanks for the feedback! These changes sound good. I was hoping to update this PR today, but ended up being too busy. I'm afraid this means it'll be a couple of weeks before I can spend any time on this, because I'm going on holidays tomorrow. |
(@jhjourdan: if the lack of these changes is blocking you over the next couple of weeks, feel free to push to this branch. I've added you as a committer on stedolan/ocaml, so you should be able to push to this branch) |
864c945
to
a3cf326
Compare
3d9e9fa
to
2008484
Compare
Current status: now all of the architectures build and generate the new format of debug info and frametables, but since I haven't updated the allocation stubs for most architectures yet, non-amd64 still doesn't work. Since e356a2d, there's now enough testing of the new allocation-size information that I'm confident that a run of the testsuite under the debug runtime will find any problems. (In particular, the minor GC now has various assertions that check young_ptr upon entry, which detects bad allocation lengths in the frametables). Once I've gotten the other architectures fixed up, I need to test them. Does anyone know how to make inria precheck run the testsuite with a debug runtime (i.e. with USE_RUNTIME=d)? |
I don't know whether there is an intended way of doing that. You can always run it on a custom branch which sets |
2008484
to
6891973
Compare
Locations of inlined frames are now represented as contiguous sequences rather than linked lists. The frame tables now refer to debug info by 32-bit offset rather than word-sized pointer.
This code is adapted from jhjourdan's 2c93ca1. Comballoc is extended to keep track of allocation sizes and debug info for each allocation, and the frame table format is modified to store them. The native code GC-entry logic is changed to match bytecode, by calling the garbage collector at most once per allocation. amd64 only, for now.
Co-Authored-By: Damien Doligez <damien.doligez@gmail.com>
In debug builds, the minor GC now asserts that young_ptr points to a valid minor heap header before starting GC. Since very few bit patterns are valid minor heap headers, this is unlikely to be true by coincidence. This patch also ensures that minor allocations have color 0. This was inconsistent between backends before.
Moves the alloc_dbginfo type to Debuginfo, to avoid a circular dependency on architectures that use Branch_relaxation. This commit generates frame tables with allocation sizes on all architectures, but does not yet update the allocation code for non-amd64 backends.
amd64: remove caml_call_gc{1,2,3} and simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. i386: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. these functions do not need to preserve ebx. arm: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. arm64: simplify caml_alloc{1,2,3,N} by tail-calling caml_call_gc. partial revert of ocaml#8619. power: partial revert of ocaml#8619. avoid restarting allocation sequence after failure. s390: partial revert of ocaml#8619. avoid restarting allocation seqeunce after failure.
efca189
to
7fe3604
Compare
I followed @jhjourdan's suggestion, and this branch has now passed precheck with a debug runtime (precheck 314). Below are some stats on frame table / debuginfo sizes. tl;dr: frame table sizes don't change much, debuginfo sizes vary a bit more widely but on average get smaller (the better format usually outweighs the greater amount). Here are some stats for ocamlopt.opt:
These sizes are affected by the optimisations in #8637 (which are included in this PR) and the fact that this PR adds more information to the tables. Frametables for allocation points have more information (now including allocation size information and debuginfo), but the debuginfo pointers are now half the size. Much more debuginfo is generated, but the more compact format of #8637 means each debuginfo entry takes less space. The combination of these effects means very little overall change in frametable sizes. The frametable and debuginfo sizes of the 263 modules linked into ocamlopt.opt are plotted below (raw data here). The plot uses a log scale, and each tick in the x-axis means 5% larger than the previous tick. The box is the interquartile range (25th to 75th percentiles) and the line is the median. There are 4 outliers (Ast_helper, Convert_primitives, Stdlib__option, X86_dsl) not shown in the debuginfo part of this graph as their debuginfo grew by more than 2.5x. These add up to 1.3kB of debuginfo in trunk vs 5.4kb in this PR, which is a large change but still a small absolute number. I suspect these are modules with many allocations but few calls, so the amount of new debuginfo dominates. There are no modules with such outlying frame table size changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a review of the architecture dependent code, which seems good to me. I added a few minor comments, which are mostly minor optimization suggestions.
Note, however, that I know almost nothing about architectures other than amd64/i386, so I cannot guarantee anything.
let tmp = if i.res.(0).loc = Reg 8 (* r12 *) then phys_reg 7 (* r7 *) | ||
else phys_reg 8 (* r12 *) | ||
in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this change mean that we can now declarer r12 as being preserved by allocations in proc.ml
?
This is also the case of r7 if fast_code_flag
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe. While the OCaml stubs no longer use r12
, it's not preserved across procedure calls as the linker may insert stubs that clobber it. See "Use of IP by the linker" on page 22 of "Procedure Call Standard for the Arm Architecture" and PR #1304, which fixed a bug along these lines for amd64 (where %r10, %r11 might be clobbered by the linker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. That's particularly subtle.
/* Record lowest stack address and return address */ | ||
pushl %ebx; CFI_ADJUST(4) | ||
movl G(Caml_state), %ebx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't %ebx
already contain Caml_state
at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right!
I've updated the other occurrence below (and done the same in i386nt.asm), but I'd prefer to leave this one as is. Removing it will make no perf difference (it's a cheap instruction on a codepath that does a lot of work), but makes the calling convention of caml_call_gc weirder.
runtime/i386.S
Outdated
@@ -150,108 +147,59 @@ LBL(105): | |||
popl %esi; CFI_ADJUST(-4) | |||
popl %edi; CFI_ADJUST(-4) | |||
popl %ebp; CFI_ADJUST(-4) | |||
/* Return to caller */ | |||
/* Return to caller. Returns young_ptr in %eax. */ | |||
movl G(Caml_state), %eax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do this: %ebx
already contains Caml_state
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@@ -39,22 +39,19 @@ INCLUDE domain_state32.inc | |||
|
|||
_caml_call_gc: | |||
; Record lowest stack address and return address | |||
push ebx ; make a tmp reg | |||
mov ebx, _Caml_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
runtime/i386nt.asm
Outdated
@@ -66,88 +63,50 @@ L105: push ebp | |||
pop esi | |||
pop edi | |||
pop ebp | |||
; Return to caller | |||
; Return to caller. Returns young_ptr in eax | |||
mov eax, _Caml_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@xavierleroy: any chance you could have a look at the power and s390 backend changes in 7fe3604? |
I had a quick look at the PowerPC and s390 changes. To summarize:
I think this is what you wanted to do, and I think you've achieved it. |
Great! I think that means that every part of this PR has been reviewed by someone, so I think this is ready to merge. |
Keep information about allocation sizes, for statmemprof, and use during GC. (cherry picked from commit 92bfafc)
…mmit 92bfafc) Keep information about allocation sizes, for statmemprof, and use during GC.
…mmit 92bfafc) Keep information about allocation sizes, for statmemprof, and use during GC.
Analysis and proposed fix here: #9461 (comment) |
It would be good to have a test: file |
Fix in #9465 . I added the test as suggested but without |
This patch makes some subtle changes to how allocations work that are needed for statmemprof, as recently discussed with @damiendoligez and @jhjourdan. The main change is to make information about allocation sizes available to the GC, and to make native-code allocations work like bytecode allocations. (Bytecode allocations are simpler, but require size information that was not previously available).
In bytecode mode, small allocations on the minor heap are straightforward, following roughly this pattern:
The function
caml_alloc_small_dispatch
undoes the failed allocation, then does a minor GC and runs any pending signal handlers / finalisers, and retries the allocation. (It may have to do this several times, as it's possible for a signal handler / finaliser to re-fill the minor heap). Finally, it logs the now-successful allocation with statmemprof, if needed.Note that this logic requires knowing how big the allocation is.
In native code, allocations are more complicated. They look like this:
The function
caml_garbage_collection
does not know how big the allocation is. Instead, it ensures that there is enough space on the minor heap for the largest possible allocation (256 words). Not passing the allocation size translates to a small saving in code size. (The line[*]
was added by #8619: without it, failed allocations take twice the space they should, throwing off GC stats.)For statmemprof to work, it needs to know how big the allocations are. To avoid affecting code size, this patch extends the frame table with a compact representation of allocation size, so that
caml_garbage_collection
can determine the allocation size without needing to be passed it explicitly.However, simply encoding the allocation size information is not quite enough. The problem is a subtle one: the allocation that we sample during
caml_garbage_collection
may not be the next one to be allocated. The issue is theredo
label: it comes before the young_limit check, so after caml_garbage_collection returns there's a new opportunity to run signal handlers and finalisers. This means that the following sequence of events is possible:The data maintained by statmemprof becomes corrupted, because the next allocation is not the one that statmemprof was expecting to happen. The problem is due to the extra polling of signals that happens between
caml_garbage_collection
and the actual allocation, and the fix is to make allocations work the same way they do on bytecode:This way, once caml_garbage_collection returns the space has definitely been allocated, and natively-compiled code need not redo the check. Internally, this calls the same
caml_alloc_small_dispatch
function used for bytecode allocations.The code used to record allocation information is lifted from 2c93ca1. (@jhjourdan: since I pretty much copy-pasted your approach, I've added you as an author to the Changes file of this PR). The tricky bit here is keeping track of allocation sizes through
Comballoc
, which may combine several allocations into one.This requires storing a lot more information in the frame tables, and generating more debug info. To keep this small, I've adapted the compact debuginfo format from #8637, which this PR supersedes.
There are a few bits left to do here:
Left for a later PR: