Skip to content

Commit

Permalink
[llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report…
Browse files Browse the repository at this point in the history
… generated by the SummaryView.

This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.

Example:
```
Iterations:        100
Instructions:      300
Total Cycles:      414
Total uOps:        700

Dispatch Width:    4
uOps Per Cycle:    1.69
IPC:               0.72
Block RThroughput: 4.0
```

This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.

llvm-svn: 340946
  • Loading branch information
Andrea Di Biagio authored and Andrea Di Biagio committed Aug 29, 2018
1 parent 5221e17 commit a2eee47
Show file tree
Hide file tree
Showing 78 changed files with 554 additions and 199 deletions.
53 changes: 33 additions & 20 deletions llvm/docs/CommandGuide/llvm-mca.rst
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,10 @@ the following command using the example located at
Iterations: 300
Instructions: 900
Total Cycles: 610
Total uOps: 900
Dispatch Width: 2
uOps Per Cycle: 1.48
IPC: 1.48
Block RThroughput: 2.0
Expand Down Expand Up @@ -285,35 +288,45 @@ the following command using the example located at
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
According to this report, the dot-product kernel has been executed 300 times,
for a total of 900 dynamically executed instructions.
for a total of 900 simulated instructions. The total number of simulated micro
opcodes (uOps) is also 900.

The report is structured in three main sections. The first section collects a
few performance numbers; the goal of this section is to give a very quick
overview of the performance throughput. In this example, the two important
performance indicators are **IPC** and **Block RThroughput** (Block Reciprocal
overview of the performance throughput. Important performance indicators are
**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
Throughput).

IPC is computed dividing the total number of simulated instructions by the total
number of cycles. A delta between Dispatch Width and IPC is an indicator of a
performance issue. In the absence of loop-carried data dependencies, the
number of cycles. In the absence of loop-carried data dependencies, the
observed IPC tends to a theoretical maximum which can be computed by dividing
the number of instructions of a single iteration by the *Block RThroughput*.

IPC is bounded from above by the dispatch width. That is because the dispatch
width limits the maximum size of a dispatch group. IPC is also limited by the
amount of hardware parallelism. The availability of hardware resources affects
the resource pressure distribution, and it limits the number of instructions
that can be executed in parallel every cycle. A delta between Dispatch
Width and the theoretical maximum IPC is an indicator of a performance
bottleneck caused by the lack of hardware resources. In general, the lower the
Block RThroughput, the better.

In this example, ``Instructions per iteration/Block RThroughput`` is 1.50. Since
there are no loop-carried dependencies, the observed IPC is expected to approach
1.50 when the number of iterations tends to infinity. The delta between the
Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of
a performance bottleneck caused by the lack of hardware resources, and the
*Resource pressure view* can help to identify the problematic resource usage.
Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
opcodes by the total number of cycles. A delta between Dispatch Width and this
field is an indicator of a performance issue. In the absence of loop-carried
data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
maximum throughput which can be computed by dividing the number of uOps of a
single iteration by the *Block RThroughput*.

Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
because the dispatch width limits the maximum size of a dispatch group. Both IPC
and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
availability of hardware resources affects the resource pressure distribution,
and it limits the number of instructions that can be executed in parallel every
cycle. A delta between Dispatch Width and the theoretical maximum uOps per
Cycle (computed by dividing the number of uOps of a single iteration by the
*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
lack of hardware resources.
In general, the lower the Block RThroughput, the better.

In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
approach 1.50 when the number of iterations tends to infinity. The delta between
the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
an indicator of a performance bottleneck caused by the lack of hardware
resources, and the *Resource pressure view* can help to identify the problematic
resource usage.

The second section of the report shows the latency and reciprocal
throughput of every instruction in the sequence. That section also reports
Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/AArch64/CortexA57/direct-branch.s
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
# CHECK: Iterations: 600
# CHECK-NEXT: Instructions: 600
# CHECK-NEXT: Total Cycles: 603
# CHECK-NEXT: Dispatch Width: 3
# CHECK-NEXT: Total uOps: 600

# CHECK: Dispatch Width: 3
# CHECK-NEXT: uOps Per Cycle: 1.00
# CHECK-NEXT: IPC: 1.00
# CHECK-NEXT: Block RThroughput: 1.0

Expand Down
11 changes: 8 additions & 3 deletions llvm/test/tools/llvm-mca/AArch64/Exynos/direct-branch.s
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,17 @@
# ALL-NEXT: Instructions: 300

# M1-NEXT: Total Cycles: 76
# M1-NEXT: Dispatch Width: 4
# M3-NEXT: Total Cycles: 51

# ALL-NEXT: Total uOps: 300

# M1: Dispatch Width: 4
# M1-NEXT: uOps Per Cycle: 3.95
# M1-NEXT: IPC: 3.95
# M1-NEXT: Block RThroughput: 0.3

# M3-NEXT: Total Cycles: 51
# M3-NEXT: Dispatch Width: 6
# M3: Dispatch Width: 6
# M3-NEXT: uOps Per Cycle: 5.88
# M3-NEXT: IPC: 5.88
# M3-NEXT: Block RThroughput: 0.2

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/AArch64/Exynos/pr38575.s
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ ror x1, x2, x3
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 100
# CHECK-NEXT: Total Cycles: 28
# CHECK-NEXT: Dispatch Width: 6
# CHECK-NEXT: Total uOps: 100

# CHECK: Dispatch Width: 6
# CHECK-NEXT: uOps Per Cycle: 3.57
# CHECK-NEXT: IPC: 3.57
# CHECK-NEXT: Block RThroughput: 0.3

Expand Down
13 changes: 8 additions & 5 deletions llvm/test/tools/llvm-mca/AArch64/Exynos/scheduler-queue-usage.s
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,16 @@
# ALL: Iterations: 1
# ALL-NEXT: Instructions: 1
# ALL-NEXT: Total Cycles: 2
# ALL-NEXT: Total uOps: 1

# M1-NEXT: Dispatch Width: 4
# M3-NEXT: Dispatch Width: 6

# ALL-NEXT: IPC: 0.50

# M1: Dispatch Width: 4
# M1-NEXT: uOps Per Cycle: 0.50
# M1-NEXT: IPC: 0.50
# M1-NEXT: Block RThroughput: 0.3

# M3: Dispatch Width: 6
# M3-NEXT: uOps Per Cycle: 0.50
# M3-NEXT: IPC: 0.50
# M3-NEXT: Block RThroughput: 0.2

# ALL: Schedulers - number of cycles where we saw N instructions issued:
Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/AArch64/Falkor/zero-latency-store.s
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
# CHECK: Iterations: 2
# CHECK-NEXT: Instructions: 2
# CHECK-NEXT: Total Cycles: 4
# CHECK-NEXT: Dispatch Width: 8
# CHECK-NEXT: Total uOps: 4

# CHECK: Dispatch Width: 8
# CHECK-NEXT: uOps Per Cycle: 1.00
# CHECK-NEXT: IPC: 0.50
# CHECK-NEXT: Block RThroughput: 1.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/ARM/simple-test-cortex-a9.s
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@ vadd.f32 s0, s2, s2
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 100
# CHECK-NEXT: Total Cycles: 105
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 100

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.95
# CHECK-NEXT: IPC: 0.95
# CHECK-NEXT: Block RThroughput: 1.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/add-sequence.s
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ add %eax, %edx
# CHECK: Iterations: 1000
# CHECK-NEXT: Instructions: 3000
# CHECK-NEXT: Total Cycles: 1506
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 3000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.99
# CHECK-NEXT: IPC: 1.99
# CHECK-NEXT: Block RThroughput: 1.5

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-1.s
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ bsf %rax, %rcx
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 400
# CHECK-NEXT: Total Cycles: 704
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 1200

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.70
# CHECK-NEXT: IPC: 0.57
# CHECK-NEXT: Block RThroughput: 6.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-2.s
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ vandps %xmm4, %xmm1, %xmm0
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 1800
# CHECK-NEXT: Total Cycles: 3811
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 3400

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.89
# CHECK-NEXT: IPC: 0.47
# CHECK-NEXT: Block RThroughput: 38.0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@ cmovae %ebx, %eax
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 3000
# CHECK-NEXT: Total Cycles: 1504
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 3000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.99
# CHECK-NEXT: IPC: 1.99
# CHECK-NEXT: Block RThroughput: 1.0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@ vpcmpeqq %xmm3, %xmm3, %xmm0
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 6000
# CHECK-NEXT: Total Cycles: 3003
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 6000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 2.00
# CHECK-NEXT: IPC: 2.00
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ vpcmpgtq %xmm3, %xmm3, %xmm0
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 6000
# CHECK-NEXT: Total Cycles: 3001
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 6000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 2.00
# CHECK-NEXT: IPC: 2.00
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ sbb %eax, %eax
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 3000
# CHECK-NEXT: Total Cycles: 3003
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 3000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.00
# CHECK-NEXT: IPC: 1.00
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ sbb %eax, %eax
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 4500
# CHECK-NEXT: Total Cycles: 3007
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 6000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 2.00
# CHECK-NEXT: IPC: 1.50
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/dependent-pmuld-paddd.s
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ vpaddd %xmm0, %xmm0, %xmm3
# CHECK: Iterations: 500
# CHECK-NEXT: Instructions: 1500
# CHECK-NEXT: Total Cycles: 1504
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 1500

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.00
# CHECK-NEXT: IPC: 1.00
# CHECK-NEXT: Block RThroughput: 1.5

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/dot-product.s
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ vhaddps %xmm3, %xmm3, %xmm4
# CHECK: Iterations: 300
# CHECK-NEXT: Instructions: 900
# CHECK-NEXT: Total Cycles: 610
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 900

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.48
# CHECK-NEXT: IPC: 1.48
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-1.s
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ vhaddps (%rdi), %xmm1, %xmm2
# CHECK: Iterations: 1
# CHECK-NEXT: Instructions: 2
# CHECK-NEXT: Total Cycles: 11
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 2

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.18
# CHECK-NEXT: IPC: 0.18
# CHECK-NEXT: Block RThroughput: 1.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/hadd-read-after-ld-2.s
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ vhaddps (%rdi), %ymm1, %ymm2
# CHECK: Iterations: 1
# CHECK-NEXT: Instructions: 2
# CHECK-NEXT: Total Cycles: 12
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 3

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.25
# CHECK-NEXT: IPC: 0.17
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down
6 changes: 5 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/instruction-info-view.s
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,11 @@ vhaddps %xmm3, %xmm3, %xmm4
# ENABLED: Iterations: 100
# ENABLED-NEXT: Instructions: 300
# ENABLED-NEXT: Total Cycles: 209
# ENABLED-NEXT: Dispatch Width: 2
# ENABLED-NEXT: Total uOps: 300


# ENABLED: Dispatch Width: 2
# ENABLED-NEXT: uOps Per Cycle: 1.44
# ENABLED-NEXT: IPC: 1.44
# ENABLED-NEXT: Block RThroughput: 2.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/load-store-alias.s
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ vmovaps %xmm0, 48(%rdi)
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 800
# CHECK-NEXT: Total Cycles: 2403
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 800

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.33
# CHECK-NEXT: IPC: 0.33
# CHECK-NEXT: Block RThroughput: 4.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/memcpy-like-test.s
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ vmovaps %xmm0, 48(%rdi)
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 800
# CHECK-NEXT: Total Cycles: 408
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 800

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.96
# CHECK-NEXT: IPC: 1.96
# CHECK-NEXT: Block RThroughput: 4.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/one-idioms.s
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ vpcmpeqw %xmm3, %xmm3, %xmm5
# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 1500
# CHECK-NEXT: Total Cycles: 753
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 1500

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.99
# CHECK-NEXT: IPC: 1.99
# CHECK-NEXT: Block RThroughput: 7.5

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-2.s
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ add %ecx, %ebx
# CHECK: Iterations: 1
# CHECK-NEXT: Instructions: 3
# CHECK-NEXT: Total Cycles: 11
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 4

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.36
# CHECK-NEXT: IPC: 0.27
# CHECK-NEXT: Block RThroughput: 4.0

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-3.s
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ xor %bx, %dx
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 4500
# CHECK-NEXT: Total Cycles: 4503
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 4500

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 1.00
# CHECK-NEXT: IPC: 1.00
# CHECK-NEXT: Block RThroughput: 1.5

Expand Down
5 changes: 4 additions & 1 deletion llvm/test/tools/llvm-mca/X86/BtVer2/partial-reg-update-4.s
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ add %cx, %bx
# CHECK: Iterations: 1500
# CHECK-NEXT: Instructions: 4500
# CHECK-NEXT: Total Cycles: 7503
# CHECK-NEXT: Dispatch Width: 2
# CHECK-NEXT: Total uOps: 6000

# CHECK: Dispatch Width: 2
# CHECK-NEXT: uOps Per Cycle: 0.80
# CHECK-NEXT: IPC: 0.60
# CHECK-NEXT: Block RThroughput: 2.0

Expand Down

0 comments on commit a2eee47

Please sign in to comment.