-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up GC by prefetching during marking #10195
Conversation
This sounds extremely promising! Thank you! I'm not the most competent reviewer for this kind of code. Nonetheless I had a look at the diff, and I was put off by lots of commented-out code, and unguarded uses of |
Hmm, there was more of that than I remembered! I've deleted the commented-out code and removed the cause of the CI failures (*), so this should be ready to review now. I ran into #10203 when trying to benchmark, though, so you'll have to wait a bit longer for sandmark numbers. (*) the issue was an assertion that all zero-length blocks are atoms, which isn't actually true because of how ocamlopt statically allocates empty arrays. |
037b718
to
8c3c478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say that with naked pointers you expect performance to be much worse, I understand that you might mean that you just have not spent time on optimising it yet, but for what it is worth here is a comment about that case.
} | ||
caml_prefetch(Hp_val(v)); | ||
caml_prefetch(&Field(v, Queue_prefetch_distance - 1)); | ||
pb[(pb_enqueued++) & Pb_mask] = v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For naked pointers, what about:
- Prefetching and queuing the naked pointer here (replacing
Is_major_block
withIs_block_and_not_young
(which is cool bit hacking by the way)) - Additionally prefetching the page table entry
- Checking
Is_in_heap
when popping the block off the queue at the start of the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page table will be removed soon. Please don't make this code more complex because of it.
runtime/major_gc.c
Outdated
uintnat min_pb = Pb_min; | ||
struct mark_stack stk = *Caml_state->mark_stack; | ||
|
||
uintnat young_start = (uintnat)Caml_state->young_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be + 1
? The definition of Is_young
gives Is_young(caml_young_start) == 0
, and although unlikely, caml_young_start - 1
is allowed to contain the header of a 0-sized block with a black gc tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right! Nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be + sizeof(header_t)
or if you want + Whsize_wosize (0)
.
I mean both that I have not spent time on optimising it, and that I think performance won't be great even with time spent optimising it. The code here often exhausts the available memory parallelism (on the Skylake core I was testing on), so doing more memory accesses is going to take more time, regardless of how well they're prefetched. Having said that, I think the prefetching strategy you describe is the right one (at least for 64-bit targets, 32-bit ones have a different pagetable). |
I meant for both page tables: for the 32-bit page table, let us hope that the first level is in cache already (the size of the first level is 512 words and it might not be randomly distributed), and only prefetch the second level. I am guessing that this would be good enough. Do you have another strategy in mind? About exhausting memory parallelism (please correct me if I say anything false, I do not consider myself an expert), I am wondering whether you are already doing something against TLB misses. They can be a bottleneck for memory parallelism, and OCaml programs with default settings are susceptible to them since the runtime will not use huge pages. I tested this on your program without the prefetching optimisation: with default settings (in particular transparent hugepages set to enabled=madvise) I get 1,75% of dTLB load misses, against 0,64% with THP enabled=always, and 0,01% with For reference, THP enabled=always gives a speed-up of about 3% on memory-intensive Coq benchmarks. ( |
Prefetching helps with TLB performance as well. Prefetching allows the processor to handle more memory requests in parallel, which include page walks if necessary. As you say, hugepages will improve performance due to better TLB utilisation, but I think that's independent of the changes here. Below are some numbers from my laptop, where it seems that hugepages have a smaller effect on the prefetching code than on trunk. (This could be explained by the prefetching code managing to overlap more TLB misses than trunk does) Times are noisy, so only 2 sig figs. Some of the trunk times have even more variance than that. trunk:
prefetching:
Interesting! Do you have a reference? (To be specific, the detail I'm most interested in is the number of L1 line fill buffers, as these tend to be the limiting factor on random accesses from a single core. More memory parallelism elsewhere in the system doesn't help GC much, as it can be difficult to saturate using non-sequential accesses from a single core) |
What your figures show corresponds indeed to what I understood with "exhausts the available memory parallelism". I understood it to mean: there is just enough memory parallelism without the page table, but by chance there is nothing spared for prefetching the page table. According to this, even with the strategy I proposed, we should expect it to be noticeably slower with the current page table than without. But given that you did not seem to do anything special for the TLB misses, I expect the combination prefetching+page table+THP to be anywhere between no gain compared to without THP (the TLB was not a bottleneck) to having same performance as without the page table (the TLB was the bottleneck, and prefetching removes the cost of the page table). To test the claim that performance will not be great with the page table, we need to make sure to run numbers with huge pages (which is just a question of using jemalloc with Another implicit question was whether with more memory parallelism you could tweak the parameters to make it go faster, in no-naked-pointers mode, but your answer implicitly seems to say no.
As a non-specialist, I enjoyed Lemire's blog, see for instance 1,2,3 for the claims in question. You can easily find other experiments like this showing increased MLP at L1 level for M1 and Zen2. Regarding your suggestion of adding |
Just to make sure we're on the same page (ha ha): the naked-pointers mode and its companion page table are going to disappear, so new improvements to memory management like the one in this PR does not need to apply to naked-pointers mode. Actually, the less it applies to naked-pointers mode the better, as a performance incentive to go with the no-naked-pointers mode. (Likewise, 32-bit target architectures are less and less relevant, and OCaml might stop supporting them at some point in the future, so new improvements to memory management should squarely focus on 64-bit architectures.) On the other hand, analyzing the performance improvements of this PR and the hardware limitations it can run into is fine, of course. Re: huge pages, that's the matter for another PR. I agree the current support for HP in the OCaml runtime system is not very usable. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review with a few comments.
runtime/caml/misc.h
Outdated
#if defined(__GNUC__) | ||
#define CAMLnoinline __attribute__ ((noinline)) | ||
#elif defined(_MSC_VER) | ||
#define CAMLnoinline __declspec(noinline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for @dra27: is this going to work on all supported versions of MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It even works on some unsupported ones! (Visual Studio .NET 2002 was when it was introduced; a fact you will I hope be pleased to know I did not have from memory...)
runtime/major_gc.c
Outdated
uintnat min_pb = Pb_min; | ||
struct mark_stack stk = *Caml_state->mark_stack; | ||
|
||
uintnat young_start = (uintnat)Caml_state->young_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be + sizeof(header_t)
or if you want + Whsize_wosize (0)
.
runtime/major_gc.c
Outdated
struct mark_stack stk = *Caml_state->mark_stack; | ||
|
||
uintnat young_start = (uintnat)Caml_state->young_start; | ||
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe using rotate1
here too would make it clearer?
runtime/major_gc.c
Outdated
/* Already black, nothing to do */ | ||
continue; | ||
} | ||
// FIXME work accounting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything missing for work accounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, but I remember not fully understanding the existing work accounting logic, so this is a note to myself to go back over and compare the two to make sure they count words the same way.
runtime/major_gc.c
Outdated
} | ||
} else if (work <= 0 || stk.count == 0) { | ||
if (min_pb > 0) { | ||
// drain pb before quitting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what's going on, you're draining pb
by pushing everything back to the stack. Wouldn't it be faster to do it explicitly here with a small loop rather than go through the generic code below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly more complicated than that. If we reach this code because work <= 0
, then you're right.
But if we reach this code because stk.count == 0
, we have not yet reached the end. It's possible that there's some bottleneck in the heap layout, and the entire rest of the heap is accessible only through a half-dozen pointers, all of which happen to be in the prefetch buffer leaving the mark stack empty.
So, we should switch to a mode where we continue marking, but try to drain the prefetch buffer. The most likely outcome is that we do a couple more iterations before hitting this code again with min_pb == 0
and quitting. However, it's also possible that we discover a huge amount more marking work by following the pointers in the prefetch buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the flow would be easier to follow if you separated the two cases:
if (work <= 0){
// push everything back to stack and quit
}else if (pb_enqueued > pb.dequeued + min_pb || stk.count == 0 && pb_enqueued > bp_dequeued){
// Dequeue from prefetch buffer
}else if (stk.count > 0){
// take something from the stack
}else{
// we're done
}
Just a suggestion. If you don't like it, I won't insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @damiendoligez, that would be much clearer that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this, but I don't currently see any easy way to make this change without duplicating lots of scanning logic for the push everything back to stack and quit
path. This path isn't trivial: entries in the prefetch buffer are possibly-unmarked headers, while entries in the mark stack are intervals needing scanning. Pushing an entry from the prefetch buffer to the stack requires darkening, and dealing with No_scan_tag, Infix_tag and closure offsets.
I think there are some misconceptions here: with more memory parallelism this code should go faster, with no tweaking of parameters required. I'll try to explain why. The processor's memory subsystem accepts requests to read memory, and after some delay it produces their contents. This delay ranges from 3-4 cycles (L1 hit) to several hundred cycles (main memory access). (In extreme cases, it can be even longer, if there is a TLB miss that causes a page walk to main memory as well as a main memory access). The memory subsystem can make useful progress on several requests simultaneously, known as "memory level parallelism" or MLP. This is what Lemire is measuring in the blogpost you linked. Most contemporary x86 processors have maximum MLP around 10, some newer ones have more. (MLP is a bit more complicated than just one number: max MLP varies depending on which caches are involved, and some resources are shared between cores while others are not. But one number is enough for here). As an example, if you issue 20 load instructions on cycles 1 through 20, on a machine with max MLP of 10 and a 300-cycle latency to main memory, then you would expect all of the loads to be completed by about cycle 600: the memory subsystem will accept the first 10 very quickly and begin the loads, but instructions 11-20 will have to queue. However, most programs do not issue a load instruction on each cycle, and so much of the available MLP often goes unused. Processors have a number of tricks to try to use more MLP: for instance, they will speculate through branches and begin performing loads before knowing for certain which way the branch goes, and they will detect sequential accesses and read ahead of the program. Unfortunately, none of these tricks are particularly effective with GC marking, which does a lot of unpredictable pointer chasing. Pointer chasing is particularly bad for MLP: the address of the next word to be examined is known only when the previous load returns, forcing things to be sequential. The prefetching GC in this PR works differently. It immediately issues up to several hundred load instructions, through software prefetch instructions. (The exact number varies. The buffer size is 256, but it's not always full, and even when full many values are not pointers). The number of issued requests vastly exceeds the maximum MLP available on the processor, so most of these requests queue. This tries to ensure that the memory subsystem is always working at full capacity: whenever a load instruction returns data, there is always another one queued ready to go. This should transparently make use of larger MLPs available on other processors. (Anyone have an M1 Mac lying around to test with? I haven't tested on a processor with more than 10). In some years' time when processors with MLP in the hundreds exist, we should perhaps consider enlarging the prefetch buffer, but 256 should be enough for now. |
@stedolan Thanks for the detailed explanation of how your patch uses MLP. This clarifies to me your comment about exhausting memory parallelism. What your figures suggest is indeed that in practice, we should expect diminished returns of working on huge pages support after this PR (whereas in theory there can be an effect). If I manage to have an AMD Zen2/3 in the near future, I will try to reproduce your test of the TLB bottleneck with your patch (people with M1s need not bother since it works with 16KB pages and does not support larger pages). |
Thanks for the explanations on MLP. One naive question:
What does "queue" means? Is is possible for the program to stall because it is issuing too many prefetch requests and exceeding the maximum available MLP? If a stall is possible, isn't that strictly worse than prefetching less or not prefetching at all? I'm confused. |
@gadmm I'm still interested in M1 benchmarks - not because of TLB issues, but because I believe the M1 has higher MLP than the machines I've been testing on. @xavierleroy Yes, this GC sometimes stalls from exceeding the maximum available MLP. This is a good thing. Fundamentally, marking is a process does many memory accesses, most of which miss cache. Only a trivial amount of computation is done to each word of memory touched, so the process is limited by memory performance, not CPU. One way or another, the CPU is going to spend some time stalled, waiting for memory. When trunk stalls, it generally stalls while scanning field When the prefetching GC stalls, it is generally not during loads, which tend to hit in L1 cache (because they were prefetched). However, it is continually issuing prefetches for memory that will be needed later, and sometimes we run into the max MLP and stall. During this kind of stall, there are as many active loads as possible, so this stall only occurs when the memory subsystem is fully utilised. In other words, stalling due to hitting maximum MLP is the ideal state for a marking loop: it means the memory accesses have been scheduled well enough to fully saturate the memory subsystem, at low enough bookkeeping overhead that the CPU has nothing to do but wait for the result. Here are some performance counter results for trunk vs. prefetching. These results should be taken with a pinch of salt: first, they're from the markbench.ml program posted above, which is something of an ideal case for prefetching, and second, they don't separate the heap-initialisation phase from the actual GC work, so are a bit noisy. Trunk:
Prefetching:
Stalls due to max MLP come under |
Of course @stedolan, this is not what I meant. Thanks again for your explanations. |
I understand you want to maximize MLP. But I still cannot convince myself this will minimize GC time, which is what we actually care about. An hypothetical scenario: so many locations are prefetched so much in advance (at the cost of multiple stalls) that by the time an actual load is done, the corresponding prefetched data was already flushed out of cache and has to be fetched again. Another hypothetical scenario: by issuing so many memory requests, the OCaml process is going to starve other processes executing on the same core (e.g. via hyperthreading), or even on other cores. A third hypothetical scenario: by issuing so many memory requests, the OCaml process is consuming lots of energy and draining your batteries faster. In other words: an ideal prefetching policy is one that makes sure all loads hit L1 cache, not one that maximizes MLP. |
OK, I'll try going through your scenarios.
This is a real concern, and is one of two factors controlling On most x86 machines of the last decade, L1 is 32KB consisting of 512 * 64byte lines and 10-20 active operations is sufficient to get good MLP. So Pb_size should be somewhere between maybe 20 and 500. In this patch it is currently 256. If you set it to, say, 2^16, then you will definitely see the issue you describe.
If this mattered, it would already occur in code that naturally achieves high MLP, such as naively summing an array of boxed integers.
This strategy does not issue any more memory requests than the current GC (assuming, as above, that you haven't set The power management strategy on most modern processors is "race to idle": since the idle states consume so much less power than the active state, the best strategy is to get the work done as quickly as possible so that you can get back to the idle state. In this model, the most power-efficient approach is whatever's fastest.
I prefer your earlier phrasing: the goal is to minimize GC time. The marking loop is limited by memory performance, since its bookkeeping overhead is low. It has a fixed number of memory accesses to do, assuming again that
Since we do not control the number of memory accesses needing to be done nor their latency, then minimising GC time is the same as maximising MLP. |
runtime/major_gc.c
Outdated
struct mark_stack stk = *Caml_state->mark_stack; | ||
|
||
uintnat young_start = (uintnat)Caml_state->young_start; | ||
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check-typo says this line is too long.
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1; | |
uintnat half_young_len = | |
((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1; |
Out of curiosity, I just tested this on msvc64 on a Threadripper 3990X... @stedolan's markbench program built with ocamlopt (averaged over 5 runs, FWIW):
The The two slow-downs are worrying. |
I can reproduce the difference between the first two even on linux amd64, looking into it now. I'm surprised by the large slowdown to this PR with broken prefetching - there's definitely some overhead for the prefetching bookkeeping if no prefetching is being done, but I'm surprised it's that large. I'll have a look. |
The difference between 9f804a2 and 5b9789c is explained by #9756 and #9951: #9756 made it go ~45% faster, and then #9951 made it go ~65% slower, leading to a ~20% slowdown overall. (Don't read too much into the 65% number: this is a microbenchmark. #9951 replaced an optimisation with a different version, but the old version happened to work particularly well on this benchmark). I'm not sure what's going on between 5b9789c and this PR. One issue is that 5b9789c isn't actually the base commit of this PR, so it's possible that the slowdown was introduced on trunk since 5b9789c. In my testing (limited to x86_64 Linux on Intel), this PR with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this PR looks good. I have a few comments, see bellow.
I am slightly worried about the Forward_tag
optimization. I don't buy the argument that most lazy values get forced soon. It depends much on the kind of program. If we decide to get rid of this optimization in the major heap, then we should have strong benchmarks showing that there is no significant performance loss.
One possibility to keep this optimization with the prefetching buffer is to store pointers to values instead of values in the prefetch buffer. This means that we do one more dereferencing per pointer in the heap, but we an hope that this is a cache hit, so this should really be cheap.
runtime/major_gc.c
Outdated
(Tag_val(v) >= No_scan_tag || !Is_black_val(v))) { | ||
v = Val_hp(Op_val(v) + Wosize_val(v)); | ||
} | ||
me.start = Op_val(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that point, it may be the case that v == end + 1
and that end
is the end of the chunk. In this case, aren't we dereferencing a pointer outside of the chunk when we are looking for the header of v?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If v
is one past the end of the chunk, then the header of v
is the the last word of the chunk. Did I misunderstand your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was misleading. As I explained above, v
could be two past the end of the chunk. Then, when we read the header, we read one past the end of the chunk.
Anyway, think about blocks: when we have finished scanning the chunk, then v
points to a block which is entirely outside of the chunk, including its header. So, we are effectively reading a header which is outside of the chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the reason we are not getting a segfault from this is the padding that caml_stat_alloc_aligned
adds.
runtime/major_gc.c
Outdated
} | ||
|
||
p += Whsize_hp(Hp_op(p)); | ||
v = Val_hp(me.end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If me.end
is the end of the chunk, then it seems like this creates a pointer more than one-past the end of the chunk, which is UB. (Well, ok, it is unlikely that the compiler will optimize this in such a way that this triggers a bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I'll change this to work on header_t*
instead of value
.
runtime/major_gc.c
Outdated
while (me.end <= end) { | ||
value v; | ||
if (stk->count < stk->size/4) { | ||
stk->stack[stk->count++] = me; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a slight change in the logic here: the previous code called mark_stack_push
while we are pushing manually here.
There are two differences:
1- We do not perform here the "Some(42)" optimization
2- Work accounting is not taken into account
Could you comment on why you decided to drop these two features here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of (1) is an oversight.
For (2), I think the work accounting here was wrong? It looks like trunk takes credit twice for marking the same object if there is a mark stack overflow. (The GC pacing logic relies on there being a fixed amount of work, determined by the shape of the heap at the time marking started).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damiendoligez, can you confirm the previous implementation was wrong wrt. work accounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous implementation is wrong. It takes credit for the first few words, then pushes the block to the stack. If the stack overflows and the block is evinced, it will take credit again for these few words.
mark_entry m = stk.stack[--stk.count]; | ||
scan = m.start; | ||
obj_end = m.end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we will be reading the block while it does not come from the prefetch buffer (and hence it might not be in the cache). It seems like the heuristic here is that we expect blocks at the top of the stack to be in the cache. Is that the idea? Is this something that you made experiments about? Could you please add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that the prefetch buffer is at its minimum size, so its entries are likely to be "not yet fetched". So we'll take a cache hit anyway. Then it makes sense to get something from the stack instead of prematurely emptying the prefetch buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as @damiendoligez says. (And yes, I benchmarked this)
*Caml_state->mark_stack = stk; | ||
realloc_mark_stack(Caml_state->mark_stack); | ||
stk = *Caml_state->mark_stack; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Caml_state->mark_stack = stk; | |
realloc_mark_stack(Caml_state->mark_stack); | |
stk = *Caml_state->mark_stack; | |
realloc_mark_stack(&stk); |
Or perhaps you do not want to do that because you want to guarantee that the compiler place stk
in registers? If so, then this is worth a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, most of the variables in this function are about using registers. I'll add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I suspect that a good compiler could figure out that &stk
is not escaping and hence keep this in a register, but this is not robust anyway.
Could you use the same pattern in mark_stack_push
, and get rid of the parameter of realloc_mark_stack
and mark_stack_prune
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any compiler that would make such a change without inlining everything.
We certainly could and should use the same pattern in mark_stack_push, but again I'd prefer to leave the ephemeron codepath as is in this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any compiler that would make such a change without inlining everything.
Sure. But C/C++ compilers tend to inline much these days.
We certainly could and should use the same pattern in mark_stack_push, but again I'd prefer to leave the ephemeron codepath as is in this patch.
As you prefer. But this conflict is not going to be so complicated to fix. And if we don't do this change here, we will probably forget it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a TODO comment?
/* auxiliary function of mark_ephe_aux */ | ||
Caml_inline void mark_ephe_darken(struct mark_stack* stk, value v, mlsize_t i, | ||
int in_ephemeron, int *slice_pointers, | ||
intnat *work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, the in_ephemeron
parameter should be removed and the code simplified accordingly, since this function is now only called from the ephemeron logic.
In addition, I wonder whether it would be possible to avoid having two implementations for essentially the same thing. Do you think we could call do_some_marking
from mark_ephe_aux
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to change the ephemeron path in this PR, because there are a number of known bugs being worked on in that code, and I don't want to cause conflicts with #9424. Once both are merged, we can do some cleanup.
work--; | ||
|
||
CAML_EVENTLOG_DO({ | ||
slice_fields++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer incremented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new version this is now incremented much more often, maybe this user-facing change (fix) deserves to be logged in the changelog.
Thanks for the review! I'll have a look at this today. |
I was thinking along similar lines, but in order to decide whether this optimisation is worthwhile we really need some programs to test with. This is where I got stuck: I have yet to find anything that puts more than a couple of thousand Lazy values on the major heap. Do you have anything in mind? |
4c2b713
to
90f648c
Compare
I just re-ran some benchmarks, comparing the current version to the original. It turns out that one of the refactorings has introduced a slowdown, causing about ~10% more instructions to be executed (and a similar, but more variable, slowdown in time). I've restored the old loop logic (the difference is about which order various tests occur in), as it is measurably faster. It's a bit less readable, but at least it's commented this time. |
This has now been "approved" for about 3 weeks, so unless someone strenuously objects I'm going to hit "merge" later today. |
I am finishing up my analysis of the inner loop which I intend to write later at the previously-mentioned link, with a very small improvement/simplification to propose regarding Essentially, instead of a single test using the rotation trick (even if it is cool), I am testing it in two separate conditions I understand the result as follows: in real programs, The improvement is small but in any case it simplifies the code. The fact that a separate Thank you to Stephen for his help (and patience) in private conversation during the review! |
My understanding is that @gadmm's comment is about a potential further improvement. That can be done in a later PR, so I'm going to click merge. Feel free to keep discussing the improvement here though. |
My understanding is that this has been addressed
Speed up GC by prefetching during marking (cherry picked from commit 1dc70d3)
Sure @lpw25 (but it creates more work to open a PR just for this). Let me know @stedolan if you are interested in the improvement/simplification. For the curious here are some benchmarks. (I offer it without interpretation. @stedolan asked for benchmarks on interestingly-large programs, but for an interpretation it is better to do it with all the details elsewhere.)
Things measured:
The test is to install some ocaml or coq packages with opam, which then runs OCaml (in the first case) or Coq (in the second case). This is with 2 physical cores and 2 jobs; the gains with all logical cores busy are likely to be lesser (since hyperthreading helps keep the physical cores busy during stalls), which would be another interesting benchmark to run. OCaml, compiling OCaml, dune, lablgtk3 and couple of smaller packages
Latency is not an interesting measure for OCaml itself, and the following table aggregates the measures of different programs so the absolute value is not meaningful in itself (for instance
Coq, compiling some heavy Mathcomp libraries (mathcomp-character and mathcomp-odd-order)
Coq's GC settings are optimised for throughput (with a minor heap of 256MB and a space overhead of 200%), so a lesser share of the time is spent in the major GC compared to OCaml. Latency is not an interesting measure for Coq, but the gain is as above. (edit: clarify that the total time also measures the improvement of the patch that introduces prefetching during sweeping) |
Speed up GC by prefetching during marking (cherry picked from commit 1dc70d3)
Speed up GC by prefetching during marking (cherry picked from commit 1dc70d3)
Speed up GC by prefetching during marking (cherry picked from commit 1dc70d3)
This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur during marking, speeding up GC.
On a microbenchmark with about 800MB of heap, the time for a
Gc.full_major
improves from about 1.8 seconds to 0.5 seconds.Of course, most programs don't spend 100% of their time in GC, so improvements are not generally this dramatic. On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)
I only expect to see any improvements on programs where the heap is much bigger than the cache. Programs that use less than a few tens of megabytes are unlikely to be improved much by this patch, although they shouldn't get slower.
Algorithm
The current marking algorithm is a pure depth-first-search: marking always proceeds with the object on top of the mark stack, and new blocks are pushed to the stack as they are found. Since the objects are unlikely to be in cache, the GC spends much of its time stalling on cache misses as it waits for the header of a new object to be loaded.
The algorithm in this PR uses a small (currently 256-entry) circular buffer as a queue in front of the mark stack, known as the prefetch buffer. During marking, the next object to be scanned is drawn from the prefetch buffer (if it has at least
Pb_min
elements), or popped from the mark stack if the prefetch buffer is empty or close to empty. When new pointers are discovered during marking, these are prefetched and enqueued in the prefetch buffer rather than being followed immediately. Blocks are only pushed to the mark stack when the prefetch buffer overflows.This ensures that new objects scanned have generally been prefetched at least
Pb_min
steps ago, which means they are very likely to already be in cache.Mark stack changes
This PR also contains a small change to the mark stack structure. It can already contain intervals, to indicate that an object is partially scanned, but those intervals are represented by a
(value, offset)
pair. This patch changes that to a(start, end)
pair, wherestart = value + offset
andend = value + Wosize_val(value)
. This saves a cache miss in determiningWosize_val(value)
when traversing the tail end of a long array.Configuration parameters
The testing and benchmarking of this patch has been done on Intel x86_64 processors with a non-default configuration:
The patch should work with naked pointers enabled, although no optimisation work has been done in this configuration and I expect performance to be much worse.
The
-mbranches-within-32B
option is a workaround for a bug in most recent Intel processors, the Intel JCC Erratum. Intel's microcode workaround for this bug introduces a performance issue in instruction decoding when branch instructions straddle 32-byte boundaries. Most programs are not seriously affected by this as instruction decoding is rarely the bottleneck. However, as the prefetching GC loop generally hits cache, instructions execute quickly (approx 2.9 of them per cycle) and it is possible to become performance-bound by decoding. On themarkbench.ml
microbenchmark above, removing the-mbranches-within-32B
option had a cost of up to 20%. (The numbers for trunk use the same configuration, although it is not as strongly affected).(We should perhaps consider enabling
-mbranches-within-32B
by default. It has a 1-2% code size penalty and a performance improvement that's usually irrelevant and occasionally dramatic. It's annoying to ask AMD users to pay the code size penalty for Intel's mistake, though).Remaining work
[The design and initial implementation of this patch was in collaboration with Will Hasenplaugh, remaining bugs are my own]