Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up GC by prefetching during marking #10195

Merged
merged 11 commits into from
Aug 10, 2021
Merged

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Feb 3, 2021

This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur during marking, speeding up GC.

On a microbenchmark with about 800MB of heap, the time for a Gc.full_major improves from about 1.8 seconds to 0.5 seconds.

Of course, most programs don't spend 100% of their time in GC, so improvements are not generally this dramatic. On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)

I only expect to see any improvements on programs where the heap is much bigger than the cache. Programs that use less than a few tens of megabytes are unlikely to be improved much by this patch, although they shouldn't get slower.

Algorithm

The current marking algorithm is a pure depth-first-search: marking always proceeds with the object on top of the mark stack, and new blocks are pushed to the stack as they are found. Since the objects are unlikely to be in cache, the GC spends much of its time stalling on cache misses as it waits for the header of a new object to be loaded.

The algorithm in this PR uses a small (currently 256-entry) circular buffer as a queue in front of the mark stack, known as the prefetch buffer. During marking, the next object to be scanned is drawn from the prefetch buffer (if it has at least Pb_min elements), or popped from the mark stack if the prefetch buffer is empty or close to empty. When new pointers are discovered during marking, these are prefetched and enqueued in the prefetch buffer rather than being followed immediately. Blocks are only pushed to the mark stack when the prefetch buffer overflows.

This ensures that new objects scanned have generally been prefetched at least Pb_min steps ago, which means they are very likely to already be in cache.

Mark stack changes

This PR also contains a small change to the mark stack structure. It can already contain intervals, to indicate that an object is partially scanned, but those intervals are represented by a (value, offset) pair. This patch changes that to a (start, end) pair, where start = value + offset and end = value + Wosize_val(value). This saves a cache miss in determining Wosize_val(value) when traversing the tail end of a long array.

Configuration parameters

The testing and benchmarking of this patch has been done on Intel x86_64 processors with a non-default configuration:

./configure --disable-naked-pointers CC='gcc -Wa,-mbranches-within-32B' AS='as -mbranches-within-32B'

The patch should work with naked pointers enabled, although no optimisation work has been done in this configuration and I expect performance to be much worse.

The -mbranches-within-32B option is a workaround for a bug in most recent Intel processors, the Intel JCC Erratum. Intel's microcode workaround for this bug introduces a performance issue in instruction decoding when branch instructions straddle 32-byte boundaries. Most programs are not seriously affected by this as instruction decoding is rarely the bottleneck. However, as the prefetching GC loop generally hits cache, instructions execute quickly (approx 2.9 of them per cycle) and it is possible to become performance-bound by decoding. On the markbench.ml microbenchmark above, removing the -mbranches-within-32B option had a cost of up to 20%. (The numbers for trunk use the same configuration, although it is not as strongly affected).

(We should perhaps consider enabling -mbranches-within-32B by default. It has a 1-2% code size penalty and a performance improvement that's usually irrelevant and occasionally dramatic. It's annoying to ask AMD users to pay the code size penalty for Intel's mistake, though).

Remaining work

  • Code cleanup There's some commented out test / debug code in this patch, and not much commenting otherwise. It needs some cleanup before merging, but this version should be good for initial discussion / testing.
  • Benchmarking This has not yet been benchmarked on interestingly-large programs.
  • Forward_tag optimisation This patch does not currently implement the Forward_tag lazy short-circuiting optimisation, which is somewhat tricky to fit with the prefetching style. Very preliminary testing has suggested that this optimisation might not be very important, as most lazy values seem to be forced early (and so caught by the minor GC's short-circuiting) or not forced at all. More benchmarking is needed, either to show that this optimisation doesn't matter much or to show that it can be implemented efficiently.

[The design and initial implementation of this patch was in collaboration with Will Hasenplaugh, remaining bugs are my own]

@xavierleroy
Copy link
Contributor

This sounds extremely promising! Thank you!

I'm not the most competent reviewer for this kind of code. Nonetheless I had a look at the diff, and I was put off by lots of commented-out code, and unguarded uses of __builtin_prefetch (we now have caml_prefetch defined in caml/misc.h). Also, all CI checks are failing. So, could you please make this PR ready for review and tell us?

@stedolan
Copy link
Contributor Author

stedolan commented Feb 8, 2021

I'm not the most competent reviewer for this kind of code. Nonetheless I had a look at the diff, and I was put off by lots of commented-out code, and unguarded uses of __builtin_prefetch (we now have caml_prefetch defined in caml/misc.h). Also, all CI checks are failing. So, could you please make this PR ready for review and tell us?

Hmm, there was more of that than I remembered! I've deleted the commented-out code and removed the cause of the CI failures (*), so this should be ready to review now. I ran into #10203 when trying to benchmark, though, so you'll have to wait a bit longer for sandmark numbers.

(*) the issue was an assertion that all zero-length blocks are atoms, which isn't actually true because of how ocamlopt statically allocates empty arrays.

runtime/major_gc.c Outdated Show resolved Hide resolved
Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say that with naked pointers you expect performance to be much worse, I understand that you might mean that you just have not spent time on optimising it yet, but for what it is worth here is a comment about that case.

}
caml_prefetch(Hp_val(v));
caml_prefetch(&Field(v, Queue_prefetch_distance - 1));
pb[(pb_enqueued++) & Pb_mask] = v;
Copy link
Contributor

@gadmm gadmm Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For naked pointers, what about:

  • Prefetching and queuing the naked pointer here (replacing Is_major_block with Is_block_and_not_young (which is cool bit hacking by the way))
  • Additionally prefetching the page table entry
  • Checking Is_in_heap when popping the block off the queue at the start of the loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page table will be removed soon. Please don't make this code more complex because of it.

uintnat min_pb = Pb_min;
struct mark_stack stk = *Caml_state->mark_stack;

uintnat young_start = (uintnat)Caml_state->young_start;
Copy link
Contributor

@gadmm gadmm Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be + 1 ? The definition of Is_young gives Is_young(caml_young_start) == 0, and although unlikely, caml_young_start - 1 is allowed to contain the header of a 0-sized block with a black gc tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right! Nice catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be + sizeof(header_t) or if you want + Whsize_wosize (0).

@stedolan
Copy link
Contributor Author

When you say that with naked pointers you expect performance to be much worse, I understand that you might mean that you just have not spent time on optimising it yet, but for what it is worth here is a comment about that case.

I mean both that I have not spent time on optimising it, and that I think performance won't be great even with time spent optimising it. The code here often exhausts the available memory parallelism (on the Skylake core I was testing on), so doing more memory accesses is going to take more time, regardless of how well they're prefetched.

Having said that, I think the prefetching strategy you describe is the right one (at least for 64-bit targets, 32-bit ones have a different pagetable).

@gadmm
Copy link
Contributor

gadmm commented Feb 18, 2021

I meant for both page tables: for the 32-bit page table, let us hope that the first level is in cache already (the size of the first level is 512 words and it might not be randomly distributed), and only prefetch the second level. I am guessing that this would be good enough. Do you have another strategy in mind?

About exhausting memory parallelism (please correct me if I say anything false, I do not consider myself an expert), I am wondering whether you are already doing something against TLB misses. They can be a bottleneck for memory parallelism, and OCaml programs with default settings are susceptible to them since the runtime will not use huge pages. I tested this on your program without the prefetching optimisation: with default settings (in particular transparent hugepages set to enabled=madvise) I get 1,75% of dTLB load misses, against 0,64% with THP enabled=always, and 0,01% with OCAMLRUNPARAM=H. This has a noticeable effect on the gc duration (2.707 s/gc vs. 2.461 s/gc vs. 2.285 s/gc, respectively). Furthermore, it seems that more recent processors tend to offer more memory parallelism than Skylake, but only if your bottleneck is not the TLB.

For reference, THP enabled=always gives a speed-up of about 3% on memory-intensive Coq benchmarks.

(OCAMLRUNPARAM=H requires that you pre-allocate 500 huge pages of size 2MB with hugeadm, which is likely to fail if your computer was not started recently. A simpler solution is to use jemalloc with the option thp:always, which benefits the whole allocated memory and not just the major heap, outperforms the H options in terms of TLB misses (0,00%) and still gives good results when the memory is fragmented (0,20% TLB misses vs. having to reboot). For reference see the command lines I used and the results.)

@stedolan
Copy link
Contributor Author

Prefetching helps with TLB performance as well. Prefetching allows the processor to handle more memory requests in parallel, which include page walks if necessary.

As you say, hugepages will improve performance due to better TLB utilisation, but I think that's independent of the changes here. Below are some numbers from my laptop, where it seems that hugepages have a smaller effect on the prefetching code than on trunk. (This could be explained by the prefetching code managing to overlap more TLB misses than trunk does) Times are noisy, so only 2 sig figs. Some of the trunk times have even more variance than that.

trunk:

dTLB misses time
THP enabled=madvise 1.1% 3.8 s/gc
THP enabled=always 0.4% 3.4 s/gc
jemalloc thp:always 0.2% 3.3 s/gc

prefetching:

dTLB misses time
THP enabled=madvise 0.5% 0.60 s/gc
THP enabled=always 0.2% 0.56 s/gc
jemalloc thp:always 0.02% 0.56 s/gc

Furthermore, it seems that more recent processors tend to offer more memory parallelism than Skylake, but only if your bottleneck is not the TLB.

Interesting! Do you have a reference? (To be specific, the detail I'm most interested in is the number of L1 line fill buffers, as these tend to be the limiting factor on random accesses from a single core. More memory parallelism elsewhere in the system doesn't help GC much, as it can be difficult to saturate using non-sequential accesses from a single core)

@gadmm
Copy link
Contributor

gadmm commented Feb 22, 2021

What your figures show corresponds indeed to what I understood with "exhausts the available memory parallelism". I understood it to mean: there is just enough memory parallelism without the page table, but by chance there is nothing spared for prefetching the page table. According to this, even with the strategy I proposed, we should expect it to be noticeably slower with the current page table than without. But given that you did not seem to do anything special for the TLB misses, I expect the combination prefetching+page table+THP to be anywhere between no gain compared to without THP (the TLB was not a bottleneck) to having same performance as without the page table (the TLB was the bottleneck, and prefetching removes the cost of the page table). To test the claim that performance will not be great with the page table, we need to make sure to run numbers with huge pages (which is just a question of using jemalloc with LD_PRELOAD as we have seen). Unless I misunderstood and you meant something completely different.

Another implicit question was whether with more memory parallelism you could tweak the parameters to make it go faster, in no-naked-pointers mode, but your answer implicitly seems to say no.

Interesting! Do you have a reference?

As a non-specialist, I enjoyed Lemire's blog, see for instance 1,2,3 for the claims in question. You can easily find other experiments like this showing increased MLP at L1 level for M1 and Zen2.

Regarding your suggestion of adding -mbranches-within-32B, since this is relevant for performance in this patch, how about going ahead and proposing the change (here or in a separate PR?).

@xavierleroy
Copy link
Contributor

I understood it to mean: there is just enough memory parallelism without the page table, but by chance there is nothing spared for prefetching the page table.

Just to make sure we're on the same page (ha ha): the naked-pointers mode and its companion page table are going to disappear, so new improvements to memory management like the one in this PR does not need to apply to naked-pointers mode. Actually, the less it applies to naked-pointers mode the better, as a performance incentive to go with the no-naked-pointers mode.

(Likewise, 32-bit target architectures are less and less relevant, and OCaml might stop supporting them at some point in the future, so new improvements to memory management should squarely focus on 64-bit architectures.)

On the other hand, analyzing the performance improvements of this PR and the hardware limitations it can run into is fine, of course.

Re: huge pages, that's the matter for another PR. I agree the current support for HP in the OCaml runtime system is not very usable. If jemalloc can help, that's good.

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review with a few comments.

#if defined(__GNUC__)
#define CAMLnoinline __attribute__ ((noinline))
#elif defined(_MSC_VER)
#define CAMLnoinline __declspec(noinline)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @dra27: is this going to work on all supported versions of MSVC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It even works on some unsupported ones! (Visual Studio .NET 2002 was when it was introduced; a fact you will I hope be pleased to know I did not have from memory...)

uintnat min_pb = Pb_min;
struct mark_stack stk = *Caml_state->mark_stack;

uintnat young_start = (uintnat)Caml_state->young_start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be + sizeof(header_t) or if you want + Whsize_wosize (0).

struct mark_stack stk = *Caml_state->mark_stack;

uintnat young_start = (uintnat)Caml_state->young_start;
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using rotate1 here too would make it clearer?

runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Show resolved Hide resolved
/* Already black, nothing to do */
continue;
}
// FIXME work accounting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything missing for work accounting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but I remember not fully understanding the existing work accounting logic, so this is a note to myself to go back over and compare the two to make sure they count words the same way.

}
} else if (work <= 0 || stk.count == 0) {
if (min_pb > 0) {
// drain pb before quitting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand what's going on, you're draining pb by pushing everything back to the stack. Wouldn't it be faster to do it explicitly here with a small loop rather than go through the generic code below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly more complicated than that. If we reach this code because work <= 0, then you're right.

But if we reach this code because stk.count == 0, we have not yet reached the end. It's possible that there's some bottleneck in the heap layout, and the entire rest of the heap is accessible only through a half-dozen pointers, all of which happen to be in the prefetch buffer leaving the mark stack empty.

So, we should switch to a mode where we continue marking, but try to drain the prefetch buffer. The most likely outcome is that we do a couple more iterations before hitting this code again with min_pb == 0 and quitting. However, it's also possible that we discover a huge amount more marking work by following the pointers in the prefetch buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the flow would be easier to follow if you separated the two cases:

if (work <= 0){
  // push everything back to stack and quit
}else if (pb_enqueued > pb.dequeued + min_pb || stk.count == 0 && pb_enqueued > bp_dequeued){
  // Dequeue from prefetch buffer
}else if (stk.count > 0){
  // take something from the stack
}else{
  // we're done
}

Just a suggestion. If you don't like it, I won't insist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @damiendoligez, that would be much clearer that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this, but I don't currently see any easy way to make this change without duplicating lots of scanning logic for the push everything back to stack and quit path. This path isn't trivial: entries in the prefetch buffer are possibly-unmarked headers, while entries in the mark stack are intervals needing scanning. Pushing an entry from the prefetch buffer to the stack requires darkening, and dealing with No_scan_tag, Infix_tag and closure offsets.

@stedolan
Copy link
Contributor Author

stedolan commented Mar 1, 2021

@gadmm:

What your figures show corresponds indeed to what I understood with "exhausts the available memory parallelism". I understood it to mean: there is just enough memory parallelism without the page table, but by chance there is nothing spared for prefetching the page table. According to this, even with the strategy I proposed, we should expect it to be noticeably slower with the current page table than without. But given that you did not seem to do anything special for the TLB misses, I expect the combination prefetching+page table+THP to be anywhere between no gain compared to without THP (the TLB was not a bottleneck) to having same performance as without the page table (the TLB was the bottleneck, and prefetching removes the cost of the page table). To test the claim that performance will not be great with the page table, we need to make sure to run numbers with huge pages (which is just a question of using jemalloc with LD_PRELOAD as we have seen). Unless I misunderstood and you meant something completely different.

Another implicit question was whether with more memory parallelism you could tweak the parameters to make it go faster, in no-naked-pointers mode, but your answer implicitly seems to say no.

I think there are some misconceptions here: with more memory parallelism this code should go faster, with no tweaking of parameters required. I'll try to explain why.

The processor's memory subsystem accepts requests to read memory, and after some delay it produces their contents. This delay ranges from 3-4 cycles (L1 hit) to several hundred cycles (main memory access). (In extreme cases, it can be even longer, if there is a TLB miss that causes a page walk to main memory as well as a main memory access).

The memory subsystem can make useful progress on several requests simultaneously, known as "memory level parallelism" or MLP. This is what Lemire is measuring in the blogpost you linked. Most contemporary x86 processors have maximum MLP around 10, some newer ones have more. (MLP is a bit more complicated than just one number: max MLP varies depending on which caches are involved, and some resources are shared between cores while others are not. But one number is enough for here).

As an example, if you issue 20 load instructions on cycles 1 through 20, on a machine with max MLP of 10 and a 300-cycle latency to main memory, then you would expect all of the loads to be completed by about cycle 600: the memory subsystem will accept the first 10 very quickly and begin the loads, but instructions 11-20 will have to queue.

However, most programs do not issue a load instruction on each cycle, and so much of the available MLP often goes unused. Processors have a number of tricks to try to use more MLP: for instance, they will speculate through branches and begin performing loads before knowing for certain which way the branch goes, and they will detect sequential accesses and read ahead of the program. Unfortunately, none of these tricks are particularly effective with GC marking, which does a lot of unpredictable pointer chasing. Pointer chasing is particularly bad for MLP: the address of the next word to be examined is known only when the previous load returns, forcing things to be sequential.

The prefetching GC in this PR works differently. It immediately issues up to several hundred load instructions, through software prefetch instructions. (The exact number varies. The buffer size is 256, but it's not always full, and even when full many values are not pointers). The number of issued requests vastly exceeds the maximum MLP available on the processor, so most of these requests queue. This tries to ensure that the memory subsystem is always working at full capacity: whenever a load instruction returns data, there is always another one queued ready to go.

This should transparently make use of larger MLPs available on other processors. (Anyone have an M1 Mac lying around to test with? I haven't tested on a processor with more than 10). In some years' time when processors with MLP in the hundreds exist, we should perhaps consider enlarging the prefetch buffer, but 256 should be enough for now.

@gadmm
Copy link
Contributor

gadmm commented Mar 1, 2021

@stedolan Thanks for the detailed explanation of how your patch uses MLP. This clarifies to me your comment about exhausting memory parallelism. What your figures suggest is indeed that in practice, we should expect diminished returns of working on huge pages support after this PR (whereas in theory there can be an effect). If I manage to have an AMD Zen2/3 in the near future, I will try to reproduce your test of the TLB bottleneck with your patch (people with M1s need not bother since it works with 16KB pages and does not support larger pages).

@xavierleroy
Copy link
Contributor

Thanks for the explanations on MLP. One naive question:

The prefetching GC in this PR works differently. It immediately issues up to several hundred load instructions, through software prefetch instructions. (The exact number varies. The buffer size is 256, but it's not always full, and even when full many values are not pointers). The number of issued requests vastly exceeds the maximum MLP available on the processor, so most of these requests queue.

What does "queue" means? Is is possible for the program to stall because it is issuing too many prefetch requests and exceeding the maximum available MLP? If a stall is possible, isn't that strictly worse than prefetching less or not prefetching at all? I'm confused.

@stedolan
Copy link
Contributor Author

stedolan commented Mar 2, 2021

@gadmm I'm still interested in M1 benchmarks - not because of TLB issues, but because I believe the M1 has higher MLP than the machines I've been testing on.

@xavierleroy Yes, this GC sometimes stalls from exceeding the maximum available MLP. This is a good thing.

Fundamentally, marking is a process does many memory accesses, most of which miss cache. Only a trivial amount of computation is done to each word of memory touched, so the process is limited by memory performance, not CPU. One way or another, the CPU is going to spend some time stalled, waiting for memory.

When trunk stalls, it generally stalls while scanning field i of an object A, which points to object B, whose header is not in cache. The hardware continues to speculate past the stall, and will often predict that field i+1 of A is next to be read, and begin fetching that too. However, this is limited to fields of the same object: the address of the next object to be scanned depends on the result of reading B's header, which the processor cannot know (and does not attempt to guess) until the load of B has returned. So, during a stall, there are at most a couple of active memory loads, to the objects in fields i, i+1 (and maybe beyond, but not far beyond). The memory subsystem is underutilised, because it can handle 10 (or more...) active loads.

When the prefetching GC stalls, it is generally not during loads, which tend to hit in L1 cache (because they were prefetched). However, it is continually issuing prefetches for memory that will be needed later, and sometimes we run into the max MLP and stall. During this kind of stall, there are as many active loads as possible, so this stall only occurs when the memory subsystem is fully utilised.

In other words, stalling due to hitting maximum MLP is the ideal state for a marking loop: it means the memory accesses have been scheduled well enough to fully saturate the memory subsystem, at low enough bookkeeping overhead that the CPU has nothing to do but wait for the result.

Here are some performance counter results for trunk vs. prefetching. These results should be taken with a pinch of salt: first, they're from the markbench.ml program posted above, which is something of an ideal case for prefetching, and second, they don't separate the heap-initialisation phase from the actual GC work, so are a bit noisy.

Trunk:

  L2 accesses (unit: number of cacheline transfers):
       162,451,158      l2_rqsts.demand_data_rd_miss
       597,006,374      l2_rqsts.pf_miss
       138,159,976      l2_lines_out.useless_hwpf


  Stalls (unit: cycles spent while stalled):
    21,937,001,731      cycle_activity.stalls_l2_miss
    29,481,074,997      cycle_activity.stalls_total
    27,324,990,838      cycle_activity.stalls_l1d_miss
    28,639,113,821      cycle_activity.stalls_mem_any
    17,972,511,169      cycle_activity.stalls_l3_miss
        73,101,333      l1d_pend_miss.fb_full
    33,092,457,045      resource_stalls.any

  11.569266782 seconds time elapsed

Prefetching:

  L2 accesses (unit: number of cacheline transfers):
         5,666,881      l2_rqsts.demand_data_rd_miss
       534,525,411      l2_rqsts.pf_miss
         6,667,602      l2_lines_out.useless_hwpf

  Stalls (unit: cycles spent while stalled):
       540,698,446      cycle_activity.stalls_l2_miss
     4,489,759,835      cycle_activity.stalls_total
       969,167,512      cycle_activity.stalls_l1d_miss
     2,351,860,341      cycle_activity.stalls_mem_any
       344,528,248      cycle_activity.stalls_l3_miss
     1,622,623,969      l1d_pend_miss.fb_full
     4,430,692,002      resource_stalls.any

  3.702075640 seconds time elapsed

Stalls due to max MLP come under l1d_pend_miss.fb_full - on this Skylake processor, the limiting factor in MLP is the number of L1 line fill buffers available. Note that these essentially never happen with trunk. Despite that, the prefetching GC spends a lot less time stalled, and in particular has massively fewer stalls_{l1d,l2,l3}_miss - that is, load instructions that stall due to missing cache.

@gadmm
Copy link
Contributor

gadmm commented Mar 2, 2021

Of course @stedolan, this is not what I meant. Thanks again for your explanations.

@xavierleroy
Copy link
Contributor

I understand you want to maximize MLP. But I still cannot convince myself this will minimize GC time, which is what we actually care about.

An hypothetical scenario: so many locations are prefetched so much in advance (at the cost of multiple stalls) that by the time an actual load is done, the corresponding prefetched data was already flushed out of cache and has to be fetched again.

Another hypothetical scenario: by issuing so many memory requests, the OCaml process is going to starve other processes executing on the same core (e.g. via hyperthreading), or even on other cores.

A third hypothetical scenario: by issuing so many memory requests, the OCaml process is consuming lots of energy and draining your batteries faster.

In other words: an ideal prefetching policy is one that makes sure all loads hit L1 cache, not one that maximizes MLP.

@stedolan
Copy link
Contributor Author

stedolan commented Mar 4, 2021

I understand you want to maximize MLP. But I still cannot convince myself this will minimize GC time, which is what we actually care about.

OK, I'll try going through your scenarios.

An hypothetical scenario: so many locations are prefetched so much in advance (at the cost of multiple stalls) that by the time an actual load is done, the corresponding prefetched data was already flushed out of cache and has to be fetched again.

This is a real concern, and is one of two factors controlling Pb_size, the size of the prefetch buffer. This controls how far ahead of the GC the prefetching happens. It should be big enough to get a reasonable MLP and hide latency, but also not so big that by the time the actual loads are done the data is flushed out of cache.

On most x86 machines of the last decade, L1 is 32KB consisting of 512 * 64byte lines and 10-20 active operations is sufficient to get good MLP. So Pb_size should be somewhere between maybe 20 and 500.

In this patch it is currently 256. If you set it to, say, 2^16, then you will definitely see the issue you describe.

Another hypothetical scenario: by issuing so many memory requests, the OCaml process is going to starve other processes executing on the same core (e.g. via hyperthreading), or even on other cores.

If this mattered, it would already occur in code that naturally achieves high MLP, such as naively summing an array of boxed integers.

A third hypothetical scenario: by issuing so many memory requests, the OCaml process is consuming lots of energy and draining your batteries faster.

This strategy does not issue any more memory requests than the current GC (assuming, as above, that you haven't set Pb_size to a huge value). It just does a better job of parallelising them.

The power management strategy on most modern processors is "race to idle": since the idle states consume so much less power than the active state, the best strategy is to get the work done as quickly as possible so that you can get back to the idle state. In this model, the most power-efficient approach is whatever's fastest.

In other words: an ideal prefetching policy is one that makes sure all loads hit L1 cache, not one that maximizes MLP.

I prefer your earlier phrasing: the goal is to minimize GC time.

The marking loop is limited by memory performance, since its bookkeeping overhead is low. It has a fixed number of memory accesses to do, assuming again that Pb_size isn't so large that we do useless ones. The variables we care about are then related by:

GC time = (number of memory accesses) * (average memory latency) / (average memory-level parallelism)

Since we do not control the number of memory accesses needing to be done nor their latency, then minimising GC time is the same as maximising MLP.

struct mark_stack stk = *Caml_state->mark_stack;

uintnat young_start = (uintnat)Caml_state->young_start;
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check-typo says this line is too long.

Suggested change
uintnat half_young_len = ((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1;
uintnat half_young_len =
((uintnat)Caml_state->young_end - (uintnat)Caml_state->young_start) >> 1;

@dra27
Copy link
Member

dra27 commented Mar 4, 2021

Out of curiosity, I just tested this on msvc64 on a Threadripper 3990X... @stedolan's markbench program built with ocamlopt (averaged over 5 runs, FWIW):

  • On 9f804a2 (I built an old branch by mistake): 2.129s
  • On 5b9789c (this PR's base): 2.368s
  • On 8c3c478 (this PR's present HEAD): 2.709s
  • With #define caml_prefetch(p) _mm_prefetch(p, 3): 0.732s

The caml_prefetch change for MSVC could go in this PR, but I need to work out how to do it properly first (I'm not sure 3 means the same thing here, but it's too late to stare at dissassemblies...)

The two slow-downs are worrying.

@stedolan
Copy link
Contributor Author

stedolan commented Mar 8, 2021

I can reproduce the difference between the first two even on linux amd64, looking into it now. I'm surprised by the large slowdown to this PR with broken prefetching - there's definitely some overhead for the prefetching bookkeeping if no prefetching is being done, but I'm surprised it's that large. I'll have a look.

@stedolan
Copy link
Contributor Author

stedolan commented Mar 9, 2021

The difference between 9f804a2 and 5b9789c is explained by #9756 and #9951: #9756 made it go ~45% faster, and then #9951 made it go ~65% slower, leading to a ~20% slowdown overall.

(Don't read too much into the 65% number: this is a microbenchmark. #9951 replaced an optimisation with a different version, but the old version happened to work particularly well on this benchmark).

I'm not sure what's going on between 5b9789c and this PR. One issue is that 5b9789c isn't actually the base commit of this PR, so it's possible that the slowdown was introduced on trunk since 5b9789c. In my testing (limited to x86_64 Linux on Intel), this PR with caml_prefetch disabled has generally outperformed trunk, so I'd like to see if the result persists when testing against 8a90546 or recent trunk.

@dra27
Copy link
Member

dra27 commented Mar 9, 2021

Today:

  • On 8a90546 (this PR's actual base!): 2.326s
  • On 8c3c478 (this PR's present HEAD): 2.765s

And then with --disable-naked-pointers (sorry about that...):

  • On 9f804a2: 1.806s
  • On 8a90546: 2.054s
  • On 8c3c478: 2.275s
  • With #define caml_prefetch(p) _mm_prefetch(p, 3): 0.633s

@dra27
Copy link
Member

dra27 commented Mar 9, 2021

I did the same with mingw-w64:

Which hints that the slowdown is down to the C compiler

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this PR looks good. I have a few comments, see bellow.

I am slightly worried about the Forward_tag optimization. I don't buy the argument that most lazy values get forced soon. It depends much on the kind of program. If we decide to get rid of this optimization in the major heap, then we should have strong benchmarks showing that there is no significant performance loss.

One possibility to keep this optimization with the prefetching buffer is to store pointers to values instead of values in the prefetch buffer. This means that we do one more dereferencing per pointer in the heap, but we an hope that this is a cache hit, so this should really be cheap.

runtime/caml/major_gc.h Show resolved Hide resolved
(Tag_val(v) >= No_scan_tag || !Is_black_val(v))) {
v = Val_hp(Op_val(v) + Wosize_val(v));
}
me.start = Op_val(v);
Copy link
Contributor

@jhjourdan jhjourdan Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point, it may be the case that v == end + 1 and that end is the end of the chunk. In this case, aren't we dereferencing a pointer outside of the chunk when we are looking for the header of v?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If v is one past the end of the chunk, then the header of v is the the last word of the chunk. Did I misunderstand your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was misleading. As I explained above, v could be two past the end of the chunk. Then, when we read the header, we read one past the end of the chunk.

Anyway, think about blocks: when we have finished scanning the chunk, then v points to a block which is entirely outside of the chunk, including its header. So, we are effectively reading a header which is outside of the chunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the reason we are not getting a segfault from this is the padding that caml_stat_alloc_aligned adds.

runtime/caml/major_gc.h Show resolved Hide resolved
}

p += Whsize_hp(Hp_op(p));
v = Val_hp(me.end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If me.end is the end of the chunk, then it seems like this creates a pointer more than one-past the end of the chunk, which is UB. (Well, ok, it is unlikely that the compiler will optimize this in such a way that this triggers a bug).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I'll change this to work on header_t* instead of value.

while (me.end <= end) {
value v;
if (stk->count < stk->size/4) {
stk->stack[stk->count++] = me;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a slight change in the logic here: the previous code called mark_stack_push while we are pushing manually here.

There are two differences:
1- We do not perform here the "Some(42)" optimization
2- Work accounting is not taken into account

Could you comment on why you decided to drop these two features here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of (1) is an oversight.

For (2), I think the work accounting here was wrong? It looks like trunk takes credit twice for marking the same object if there is a mark stack overflow. (The GC pacing logic relies on there being a fixed amount of work, determined by the shape of the heap at the time marking started).

Copy link
Contributor

@jhjourdan jhjourdan Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damiendoligez, can you confirm the previous implementation was wrong wrt. work accounting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous implementation is wrong. It takes credit for the first few words, then pushes the block to the stack. If the stack overflows and the block is evinced, it will take credit again for these few words.

runtime/major_gc.c Outdated Show resolved Hide resolved
Comment on lines +649 to +698
mark_entry m = stk.stack[--stk.count];
scan = m.start;
obj_end = m.end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we will be reading the block while it does not come from the prefetch buffer (and hence it might not be in the cache). It seems like the heuristic here is that we expect blocks at the top of the stack to be in the cache. Is that the idea? Is this something that you made experiments about? Could you please add a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that the prefetch buffer is at its minimum size, so its entries are likely to be "not yet fetched". So we'll take a cache hit anyway. Then it makes sense to get something from the stack instead of prematurely emptying the prefetch buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as @damiendoligez says. (And yes, I benchmarked this)

Comment on lines +679 to +738
*Caml_state->mark_stack = stk;
realloc_mark_stack(Caml_state->mark_stack);
stk = *Caml_state->mark_stack;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Caml_state->mark_stack = stk;
realloc_mark_stack(Caml_state->mark_stack);
stk = *Caml_state->mark_stack;
realloc_mark_stack(&stk);

Or perhaps you do not want to do that because you want to guarantee that the compiler place stk in registers? If so, then this is worth a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most of the variables in this function are about using registers. I'll add some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I suspect that a good compiler could figure out that &stk is not escaping and hence keep this in a register, but this is not robust anyway.

Could you use the same pattern in mark_stack_push, and get rid of the parameter of realloc_mark_stack and mark_stack_prune ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any compiler that would make such a change without inlining everything.

We certainly could and should use the same pattern in mark_stack_push, but again I'd prefer to leave the ephemeron codepath as is in this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any compiler that would make such a change without inlining everything.

Sure. But C/C++ compilers tend to inline much these days.

We certainly could and should use the same pattern in mark_stack_push, but again I'd prefer to leave the ephemeron codepath as is in this patch.

As you prefer. But this conflict is not going to be so complicated to fix. And if we don't do this change here, we will probably forget it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a TODO comment?

Comment on lines +413 to 436
/* auxiliary function of mark_ephe_aux */
Caml_inline void mark_ephe_darken(struct mark_stack* stk, value v, mlsize_t i,
int in_ephemeron, int *slice_pointers,
intnat *work)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, the in_ephemeron parameter should be removed and the code simplified accordingly, since this function is now only called from the ephemeron logic.

In addition, I wonder whether it would be possible to avoid having two implementations for essentially the same thing. Do you think we could call do_some_marking from mark_ephe_aux?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to change the ephemeron path in this PR, because there are a number of known bugs being worked on in that code, and I don't want to cause conflicts with #9424. Once both are merged, we can do some cleanup.

work--;

CAML_EVENTLOG_DO({
slice_fields++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer incremented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version this is now incremented much more often, maybe this user-facing change (fix) deserves to be logged in the changelog.

@stedolan
Copy link
Contributor Author

Thanks for the review! I'll have a look at this today.

@stedolan
Copy link
Contributor Author

I am slightly worried about the Forward_tag optimization. I don't buy the argument that most lazy values get forced soon. It depends much on the kind of program. If we decide to get rid of this optimization in the major heap, then we should have strong benchmarks showing that there is no significant performance loss.

One possibility to keep this optimization with the prefetching buffer is to store pointers to values instead of values in the prefetch buffer. This means that we do one more dereferencing per pointer in the heap, but we an hope that this is a cache hit, so this should eally be cheap.

I was thinking along similar lines, but in order to decide whether this optimisation is worthwhile we really need some programs to test with. This is where I got stuck: I have yet to find anything that puts more than a couple of thousand Lazy values on the major heap. Do you have anything in mind?

@stedolan
Copy link
Contributor Author

stedolan commented Aug 3, 2021

I just re-ran some benchmarks, comparing the current version to the original. It turns out that one of the refactorings has introduced a slowdown, causing about ~10% more instructions to be executed (and a similar, but more variable, slowdown in time). I've restored the old loop logic (the difference is about which order various tests occur in), as it is measurably faster. It's a bit less readable, but at least it's commented this time.

@stedolan
Copy link
Contributor Author

stedolan commented Aug 9, 2021

This has now been "approved" for about 3 weeks, so unless someone strenuously objects I'm going to hit "merge" later today.

@gadmm
Copy link
Contributor

gadmm commented Aug 9, 2021

I am finishing up my analysis of the inner loop which I intend to write later at the previously-mentioned link, with a very small improvement/simplification to propose regarding Is_block_and_not_young (I was redoing benchmarks after the last performance change). Here is what I found.

Essentially, instead of a single test using the rotation trick (even if it is cool), I am testing it in two separate conditions Is_block(v) && Is_not_young(v), where Is_not_young performs the fast range check but without the rotation. It goes faster in both synthetic and real-world benchmarks (it is hard to know by how much due to code layout effects, which I have tried to control the best I could following Stephen's advice, but it is not more than 1.5% of the marking time—at best the rotation trick does not improve performance). Your synthetic benchmark shows Is_block_and_not_young going faster but this is because the heap has a very regular pattern and so it does not show the effects of branch mispredictions. I tested OCaml and Coq runs.

I understand the result as follows: in real programs, Is_block_and_not_young and Is_block will cause a lot of branch mispredictions (they are false only 15-40% of the time in the real programs I tested, with very poor predictability). To backtrack from a branch misprediction, the faster the condition is computed the better, so Is_block performs better (2 cycles less of latency). On the other hand Is_not_young is by far almost always true, and thus it is free thanks to branch prediction and instruction-level parallelism. (Another observed situation with Coq is short slices with 60-80% of immediates and good prediction, in which case Is_block should also win. I have asked Coq developers whether they weren't needlessly scanning large arrays of immediates somewhere.)

The improvement is small but in any case it simplifies the code.

The fact that a separate Is_not_young check is free thanks to instruction-level parallelism also explains why it is possible to replace it with an optimised page table check that skips the marking of statics, without suffering from a performance penalty, as I found, contrary to initial claims.

Thank you to Stephen for his help (and patience) in private conversation during the review!

@lpw25
Copy link
Contributor

lpw25 commented Aug 10, 2021

My understanding is that @gadmm's comment is about a potential further improvement. That can be done in a later PR, so I'm going to click merge. Feel free to keep discussing the improvement here though.

@lpw25 lpw25 dismissed jhjourdan’s stale review August 10, 2021 12:51

My understanding is that this has been addressed

@lpw25 lpw25 merged commit 1dc70d3 into ocaml:trunk Aug 10, 2021
stedolan added a commit to stedolan/ocaml that referenced this pull request Aug 10, 2021
Speed up GC by prefetching during marking

(cherry picked from commit 1dc70d3)
@gadmm
Copy link
Contributor

gadmm commented Aug 10, 2021

Sure @lpw25 (but it creates more work to open a PR just for this). Let me know @stedolan if you are interested in the improvement/simplification.

For the curious here are some benchmarks. (I offer it without interpretation. @stedolan asked for benchmarks on interestingly-large programs, but for an interpretation it is better to do it with all the details elsewhere.)

  • Before: OCaml 4.12 plus the fix to work accounting
  • After: this PR together with the patch to prefetch during sweeping, rebased on top of 4.12
  • PT: with page table
  • NNP: with --disable-naked-pointers
  • After, PT (1): what is now in OCaml 4.14
  • After, PT (2): the prefetching strategy for the page table that I suggested, which I also tested and which I can propose for review if you let me know you are interested after all.

Things measured:

  • pace (ns/w): time spent inside the marking loop to perform one unit of work (i.e. decide that one word is still live this major cycle).
  • total time (mm:ss): wall-clock time of opam running; this measurement also benefits from the patch to prefetch during sweeping (take into account that there is a lot of fluctuation with this measure)
  • latency (ms): contribution of the marking loop in a mark slice to the tail latency (also fluctuates)

The test is to install some ocaml or coq packages with opam, which then runs OCaml (in the first case) or Coq (in the second case). This is with 2 physical cores and 2 jobs; the gains with all logical cores busy are likely to be lesser (since hyperthreading helps keep the physical cores busy during stalls), which would be another interesting benchmark to run.

OCaml, compiling OCaml, dune, lablgtk3 and couple of smaller packages

Throughput pace (ns/w) rel. (%) total time rel. (%)
Before, PT 18.6 4:09
Before, NNP 18.0 -3% 4:08 -0.4%
After, PT (1) 5.5 -70% 3:12 -23%
After, PT (2) 4.9 -74% 3:11 -23%
After, NNP 4.2 -78% 3:09 -24%

Latency is not an interesting measure for OCaml itself, and the following table aggregates the measures of different programs so the absolute value is not meaningful in itself (for instance ocamlc gives much longer mark slices than ocamlopt even though both have the default GC settings); however the relative improvement is consistent across programs and is going to be interesting.

Latency max ≤99.99% (ms) rel. (%)
Before, PT 219
Before, NNP 204 -7%
After, PT (1) 55 -75%
After, PT (2) 51 -77%
After, NNP 42 -81%

Coq, compiling some heavy Mathcomp libraries

(mathcomp-character and mathcomp-odd-order)

Throughput pace (ns/w) rel. (%) total time rel. (%)
Before, PT 19.6 13:41
Before, NNP 18.8 -4% 13:30 -1.3%
After, PT (1) 7.1 -64% 12:55 -5.6%
After, PT (2) 6.0 -70% 12:52 -6.0%
After, NNP 5.0 -75% 12:41 -7.3%

Coq's GC settings are optimised for throughput (with a minor heap of 256MB and a space overhead of 200%), so a lesser share of the time is spent in the major GC compared to OCaml. Latency is not an interesting measure for Coq, but the gain is as above.

(edit: clarify that the total time also measures the improvement of the patch that introduces prefetching during sweeping)

poechsel pushed a commit to ocaml-flambda/ocaml that referenced this pull request Sep 3, 2021
Speed up GC by prefetching during marking

(cherry picked from commit 1dc70d3)
chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Sep 9, 2021
Speed up GC by prefetching during marking

(cherry picked from commit 1dc70d3)
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 5, 2021
Speed up GC by prefetching during marking

(cherry picked from commit 1dc70d3)
stedolan added a commit to stedolan/ocaml that referenced this pull request Dec 13, 2021