Skip to content

Commit

Permalink
[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolli…
Browse files Browse the repository at this point in the history
…ng values (#91340)

The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.

From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).

This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.
  • Loading branch information
RKSimon committed May 16, 2024
1 parent 4a5dffc commit 54e52aa
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 965 deletions.
11 changes: 4 additions & 7 deletions llvm/lib/Target/X86/X86ScheduleZnver3.td
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,10 @@ def Znver3Model : SchedMachineModel {
// The op cache is organized as an associative cache with 64 sets and 8 ways.
// At each set-way intersection is an entry containing up to 8 macro ops.
// The maximum capacity of the op cache is 4K ops.
// Agner, 22.5 µop cache
// The size of the µop cache is big enough for holding most critical loops.
// FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
// with large values here the compilation of certain loops
// ends up taking way too long.
// let LoopMicroOpBufferSize = 4096;
let LoopMicroOpBufferSize = 512;
// Assuming a maximum dispatch of 8 ops/cy and a mispredict cost of 12cy from
// the op-cache, we limit the loop buffer to 8*12 = 96 to avoid loop unrolling
// leading to excessive filling of the op-cache from frontend.
let LoopMicroOpBufferSize = 96;
// AMD SOG 19h, 2.6.2 L1 Data Cache
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
// AMD SOG 19h, 2.12 L1 Data Cache
Expand Down
16 changes: 5 additions & 11 deletions llvm/lib/Target/X86/X86ScheduleZnver4.td
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,11 @@ def Znver4Model : SchedMachineModel {
// AMD SOG 19h, 2.9.1 Op Cache
// The op cache is organized as an associative cache with 64 sets and 8 ways.
// At each set-way intersection is an entry containing up to 8 macro ops.
// The maximum capacity of the op cache is 4K ops.
// Agner, 22.5 µop cache
// The size of the µop cache is big enough for holding most critical loops.
// FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
// with large values here the compilation of certain loops
// ends up taking way too long.
// Ideally for znver4, we should have 6.75K. However we don't add that
// considerting the impact compile time and prefer using default values
// instead.
// Retaining minimal value to influence unrolling as we did for znver3.
let LoopMicroOpBufferSize = 512;
// The maximum capacity of the op cache is 6.75K ops.
// Assuming a maximum dispatch of 9 ops/cy and a mispredict cost of 12cy from
// the op-cache, we limit the loop buffer to 9*12 = 108 to avoid loop
// unrolling leading to excessive filling of the op-cache from frontend.
let LoopMicroOpBufferSize = 108;
// AMD SOG 19h, 2.6.2 L1 Data Cache
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
// AMD SOG 19h, 2.12 L1 Data Cache
Expand Down
Loading

0 comments on commit 54e52aa

Please sign in to comment.