[llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report…

… generated by the SummaryView. This patch adds two new fields to the perf report generated by the SummaryView. Fields are now logically organized into two small groups; only the second group contains throughput indicators. Example: ``` Iterations: 100 Instructions: 300 Total Cycles: 414 Total uOps: 700 Dispatch Width: 4 uOps Per Cycle: 1.69 IPC: 0.72 Block RThroughput: 4.0 ``` This patch also updates the docs for llvm-mca. Due to the nature of this change, several tests in the tools/llvm-mca directory were affected, and had to be updated using script `update_mca_test_checks.py`. llvm-svn: 340946
llvm · Aug 29, 2018 · a2eee47 · a2eee47
1 parent 5221e17
commit a2eee47
Show file tree

Hide file tree

Showing 78 changed files with 554 additions and 199 deletions.
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst
@@ -238,7 +238,10 @@ the following command using the example located at
   Iterations:        300
   Instructions:      900
   Total Cycles:      610
+  Total uOps:        900
+
   Dispatch Width:    2
+  uOps Per Cycle:    1.48
   IPC:               1.48
   Block RThroughput: 2.0
 
@@ -285,35 +288,45 @@ the following command using the example located at
    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
 
 According to this report, the dot-product kernel has been executed 300 times,
-for a total of 900 dynamically executed instructions.
+for a total of 900 simulated instructions. The total number of simulated micro
+opcodes (uOps) is also 900.
 
 The report is structured in three main sections.  The first section collects a
 few performance numbers; the goal of this section is to give a very quick
-overview of the performance throughput. In this example, the two important
-performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal
+overview of the performance throughput. Important performance indicators are
+**IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
 Throughput).
 
 IPC is computed dividing the total number of simulated instructions by the total
-number of cycles.  A delta between Dispatch Width and IPC is an indicator of a
-performance issue. In the absence of loop-carried data dependencies, the
+number of cycles. In the absence of loop-carried data dependencies, the
 observed IPC tends to a theoretical maximum which can be computed by dividing
 the number of instructions of a single iteration by the *Block RThroughput*.
 
-IPC is bounded from above by the dispatch width. That is because the dispatch
-width limits the maximum size of a dispatch group. IPC is also limited by the
-amount of hardware parallelism. The availability of hardware resources affects
-the resource pressure distribution, and it limits the number of instructions
-that can be executed in parallel every cycle.  A delta between Dispatch
-Width and the theoretical maximum IPC is an indicator of a performance
-bottleneck caused by the lack of hardware resources. In general, the lower the
-Block RThroughput, the better.
-
-In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since
-there are no loop-carried dependencies, the observed IPC is expected to approach
-1.50 when the number of iterations tends to infinity. The delta between the
-Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of
-a performance bottleneck caused by the lack of hardware resources, and the
-*Resource pressure view* can help to identify the problematic resource usage.
+Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
+opcodes by the total number of cycles. A delta between Dispatch Width and this
+field is an indicator of a performance issue. In the absence of loop-carried
+data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
+maximum throughput which can be computed by dividing the number of uOps of a
+single iteration by the *Block RThroughput*.
+
+Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
+because the dispatch width limits the maximum size of a dispatch group. Both IPC
+and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
+availability of hardware resources affects the resource pressure distribution,
+and it limits the number of instructions that can be executed in parallel every
+cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
+Cycle (computed by dividing the number of uOps of a single iteration by the
+*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+lack of hardware resources.
+In general, the lower the Block RThroughput, the better.
+
+In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
+are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+approach 1.50 when the number of iterations tends to infinity. The delta between
+the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
+an indicator of a performance bottleneck caused by the lack of hardware
+resources, and the *Resource pressure view* can help to identify the problematic
+resource usage.
 
 The second section of the report shows the latency and reciprocal
 throughput of every instruction in the sequence. That section also reports

diff --git a/llvm/test/tools/llvm-mca/AArch64/CortexA57/direct-branch.s b/llvm/test/tools/llvm-mca/AArch64/CortexA57/direct-branch.s
@@ -6,7 +6,10 @@
 # CHECK:      Iterations:        600
 # CHECK-NEXT: Instructions:      600
 # CHECK-NEXT: Total Cycles:      603
-# CHECK-NEXT: Dispatch Width:    3
+# CHECK-NEXT: Total uOps:        600
+
+# CHECK:      Dispatch Width:    3
+# CHECK-NEXT: uOps Per Cycle:    1.00
 # CHECK-NEXT: IPC:               1.00
 # CHECK-NEXT: Block RThroughput: 1.0
 

diff --git a/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s b/llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s
@@ -8,12 +8,17 @@
 # ALL-NEXT: Instructions:      300
 
 # M1-NEXT:  Total Cycles:      76
-# M1-NEXT:  Dispatch Width:    4
+# M3-NEXT:  Total Cycles:      51
+
+# ALL-NEXT: Total uOps:        300
+
+# M1:       Dispatch Width:    4
+# M1-NEXT:  uOps Per Cycle:    3.95
 # M1-NEXT:  IPC:               3.95
 # M1-NEXT:  Block RThroughput: 0.3
 
-# M3-NEXT:  Total Cycles:      51
-# M3-NEXT:  Dispatch Width:    6
+# M3:       Dispatch Width:    6
+# M3-NEXT:  uOps Per Cycle:    5.88
 # M3-NEXT:  IPC:               5.88
 # M3-NEXT:  Block RThroughput: 0.2
 

diff --git a/llvm/test/tools/llvm-mca/AArch64/Exynos/pr38575.s b/llvm/test/tools/llvm-mca/AArch64/Exynos/pr38575.s
@@ -14,7 +14,10 @@ ror x1, x2, x3
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      100
 # CHECK-NEXT: Total Cycles:      28
-# CHECK-NEXT: Dispatch Width:    6
+# CHECK-NEXT: Total uOps:        100
+
+# CHECK:      Dispatch Width:    6
+# CHECK-NEXT: uOps Per Cycle:    3.57
 # CHECK-NEXT: IPC:               3.57
 # CHECK-NEXT: Block RThroughput: 0.3
 

diff --git a/llvm/test/tools/llvm-mca/AArch64/Exynos/scheduler-queue-usage.s b/llvm/test/tools/llvm-mca/AArch64/Exynos/scheduler-queue-usage.s
@@ -7,13 +7,16 @@
 # ALL:      Iterations:        1
 # ALL-NEXT: Instructions:      1
 # ALL-NEXT: Total Cycles:      2
+# ALL-NEXT: Total uOps:        1
 
-# M1-NEXT:  Dispatch Width:    4
-# M3-NEXT:  Dispatch Width:    6
-
-# ALL-NEXT: IPC:               0.50
-
+# M1:       Dispatch Width:    4
+# M1-NEXT:  uOps Per Cycle:    0.50
+# M1-NEXT:  IPC:               0.50
 # M1-NEXT:  Block RThroughput: 0.3
+
+# M3:       Dispatch Width:    6
+# M3-NEXT:  uOps Per Cycle:    0.50
+# M3-NEXT:  IPC:               0.50
 # M3-NEXT:  Block RThroughput: 0.2
 
 # ALL:      Schedulers - number of cycles where we saw N instructions issued:

diff --git a/llvm/test/tools/llvm-mca/AArch64/Falkor/zero-latency-store.s b/llvm/test/tools/llvm-mca/AArch64/Falkor/zero-latency-store.s
@@ -6,7 +6,10 @@
 # CHECK:      Iterations:        2
 # CHECK-NEXT: Instructions:      2
 # CHECK-NEXT: Total Cycles:      4
-# CHECK-NEXT: Dispatch Width:    8
+# CHECK-NEXT: Total uOps:        4
+
+# CHECK:      Dispatch Width:    8
+# CHECK-NEXT: uOps Per Cycle:    1.00
 # CHECK-NEXT: IPC:               0.50
 # CHECK-NEXT: Block RThroughput: 1.0
 

diff --git a/llvm/test/tools/llvm-mca/ARM/simple-test-cortex-a9.s b/llvm/test/tools/llvm-mca/ARM/simple-test-cortex-a9.s
@@ -6,7 +6,10 @@ vadd.f32 s0, s2, s2
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      100
 # CHECK-NEXT: Total Cycles:      105
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        100
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.95
 # CHECK-NEXT: IPC:               0.95
 # CHECK-NEXT: Block RThroughput: 1.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/add-sequence.s b/llvm/test/tools/llvm-mca/X86/BtVer2/add-sequence.s
@@ -8,7 +8,10 @@ add %eax, %edx
 # CHECK:      Iterations:        1000
 # CHECK-NEXT: Instructions:      3000
 # CHECK-NEXT: Total Cycles:      1506
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        3000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.99
 # CHECK-NEXT: IPC:               1.99
 # CHECK-NEXT: Block RThroughput: 1.5
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-1.s b/llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-1.s
@@ -16,7 +16,10 @@ bsf   %rax, %rcx
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      400
 # CHECK-NEXT: Total Cycles:      704
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        1200
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.70
 # CHECK-NEXT: IPC:               0.57
 # CHECK-NEXT: Block RThroughput: 6.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-2.s b/llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-2.s
@@ -34,7 +34,10 @@ vandps %xmm4, %xmm1, %xmm0
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      1800
 # CHECK-NEXT: Total Cycles:      3811
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        3400
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.89
 # CHECK-NEXT: IPC:               0.47
 # CHECK-NEXT: Block RThroughput: 38.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-cmp.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-cmp.s
@@ -12,7 +12,10 @@ cmovae %ebx, %eax
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      3000
 # CHECK-NEXT: Total Cycles:      1504
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        3000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.99
 # CHECK-NEXT: IPC:               1.99
 # CHECK-NEXT: Block RThroughput: 1.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-pcmpeq.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-pcmpeq.s
@@ -15,7 +15,10 @@ vpcmpeqq %xmm3, %xmm3, %xmm0
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      6000
 # CHECK-NEXT: Total Cycles:      3003
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        6000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    2.00
 # CHECK-NEXT: IPC:               2.00
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-pcmpgt.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-pcmpgt.s
@@ -16,7 +16,10 @@ vpcmpgtq %xmm3, %xmm3, %xmm0
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      6000
 # CHECK-NEXT: Total Cycles:      3001
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        6000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    2.00
 # CHECK-NEXT: IPC:               2.00
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-sbb-1.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-sbb-1.s
@@ -13,7 +13,10 @@ sbb %eax, %eax
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      3000
 # CHECK-NEXT: Total Cycles:      3003
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        3000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.00
 # CHECK-NEXT: IPC:               1.00
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-sbb-2.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependency-breaking-sbb-2.s
@@ -14,7 +14,10 @@ sbb %eax, %eax
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      4500
 # CHECK-NEXT: Total Cycles:      3007
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        6000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    2.00
 # CHECK-NEXT: IPC:               1.50
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dependent-pmuld-paddd.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dependent-pmuld-paddd.s
@@ -8,7 +8,10 @@ vpaddd %xmm0, %xmm0, %xmm3
 # CHECK:      Iterations:        500
 # CHECK-NEXT: Instructions:      1500
 # CHECK-NEXT: Total Cycles:      1504
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        1500
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.00
 # CHECK-NEXT: IPC:               1.00
 # CHECK-NEXT: Block RThroughput: 1.5
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/dot-product.s b/llvm/test/tools/llvm-mca/X86/BtVer2/dot-product.s
@@ -8,7 +8,10 @@ vhaddps  %xmm3, %xmm3, %xmm4
 # CHECK:      Iterations:        300
 # CHECK-NEXT: Instructions:      900
 # CHECK-NEXT: Total Cycles:      610
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        900
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.48
 # CHECK-NEXT: IPC:               1.48
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-1.s b/llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-1.s
@@ -7,7 +7,10 @@ vhaddps (%rdi), %xmm1, %xmm2
 # CHECK:      Iterations:        1
 # CHECK-NEXT: Instructions:      2
 # CHECK-NEXT: Total Cycles:      11
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        2
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.18
 # CHECK-NEXT: IPC:               0.18
 # CHECK-NEXT: Block RThroughput: 1.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-2.s b/llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-2.s
@@ -7,7 +7,10 @@ vhaddps (%rdi), %ymm1, %ymm2
 # CHECK:      Iterations:        1
 # CHECK-NEXT: Instructions:      2
 # CHECK-NEXT: Total Cycles:      12
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        3
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.25
 # CHECK-NEXT: IPC:               0.17
 # CHECK-NEXT: Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/instruction-info-view.s b/llvm/test/tools/llvm-mca/X86/BtVer2/instruction-info-view.s
@@ -14,7 +14,11 @@ vhaddps  %xmm3, %xmm3, %xmm4
 # ENABLED:       Iterations:        100
 # ENABLED-NEXT:  Instructions:      300
 # ENABLED-NEXT:  Total Cycles:      209
-# ENABLED-NEXT:  Dispatch Width:    2
+# ENABLED-NEXT:  Total uOps:        300
+
+
+# ENABLED:       Dispatch Width:    2
+# ENABLED-NEXT:  uOps Per Cycle:    1.44
 # ENABLED-NEXT:  IPC:               1.44
 # ENABLED-NEXT:  Block RThroughput: 2.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/load-store-alias.s b/llvm/test/tools/llvm-mca/X86/BtVer2/load-store-alias.s
@@ -13,7 +13,10 @@ vmovaps %xmm0, 48(%rdi)
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      800
 # CHECK-NEXT: Total Cycles:      2403
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        800
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.33
 # CHECK-NEXT: IPC:               0.33
 # CHECK-NEXT: Block RThroughput: 4.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/memcpy-like-test.s b/llvm/test/tools/llvm-mca/X86/BtVer2/memcpy-like-test.s
@@ -13,7 +13,10 @@ vmovaps %xmm0, 48(%rdi)
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      800
 # CHECK-NEXT: Total Cycles:      408
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        800
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.96
 # CHECK-NEXT: IPC:               1.96
 # CHECK-NEXT: Block RThroughput: 4.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/one-idioms.s b/llvm/test/tools/llvm-mca/X86/BtVer2/one-idioms.s
@@ -30,7 +30,10 @@ vpcmpeqw  %xmm3, %xmm3, %xmm5
 # CHECK:      Iterations:        100
 # CHECK-NEXT: Instructions:      1500
 # CHECK-NEXT: Total Cycles:      753
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        1500
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.99
 # CHECK-NEXT: IPC:               1.99
 # CHECK-NEXT: Block RThroughput: 7.5
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-2.s b/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-2.s
@@ -8,7 +8,10 @@ add    %ecx, %ebx
 # CHECK:      Iterations:        1
 # CHECK-NEXT: Instructions:      3
 # CHECK-NEXT: Total Cycles:      11
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        4
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.36
 # CHECK-NEXT: IPC:               0.27
 # CHECK-NEXT: Block RThroughput: 4.0
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-3.s b/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-3.s
@@ -13,7 +13,10 @@ xor %bx, %dx
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      4500
 # CHECK-NEXT: Total Cycles:      4503
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        4500
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    1.00
 # CHECK-NEXT: IPC:               1.00
 # CHECK-NEXT: Block RThroughput: 1.5
 

diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-4.s b/llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-4.s
@@ -13,7 +13,10 @@ add %cx, %bx
 # CHECK:      Iterations:        1500
 # CHECK-NEXT: Instructions:      4500
 # CHECK-NEXT: Total Cycles:      7503
-# CHECK-NEXT: Dispatch Width:    2
+# CHECK-NEXT: Total uOps:        6000
+
+# CHECK:      Dispatch Width:    2
+# CHECK-NEXT: uOps Per Cycle:    0.80
 # CHECK-NEXT: IPC:               0.60
 # CHECK-NEXT: Block RThroughput: 2.0