# Using Performance Counters

In [1]:
!mkdir -p tmp

## Using `perf`

A Linux tool for accessing performance counters.

See also the [Wiki documentation](https://perf.wiki.kernel.org/index.php/Main_Page) for `perf`.

In [2]:
!perf list

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  cgroup-switches                                    [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy             

Examine the output of the following in a terminal:

* `perf top`
* `perf top -z`
* `perf top -e cache-misses`
* `perf top -e cache-misses,cycles`


In [3]:
%%writefile tmp/transpose.c

#include <stdio.h>
#include <stdlib.h>
 
int main()
{
    const int m = 1024;
    const int n = 1024;
    int *matrix = malloc(sizeof(int) * m * n);
    int *transpose = malloc(sizeof(int) * m * n);
    
    for (int c = 0; c < m; c++)
       for(int d = 0; d < n; d++)
          matrix[c*m + d] = c+d;

    for (int i = 0; i < 300; ++i)
        for (int c = 0; c < m; c++)
           for(int d = 0 ; d < n ; d++)
              transpose[d*n + c] = matrix[c*m + d];
 
    printf("Transpose of the matrix:\n");
 
    int sum = 0;
    for (int c = 0; c < n; c++)
       for (int d = 0; d < m; d++)
          sum += transpose[d*n + c];
    printf("sum: %d\n", sum);

    return 0;
}

Overwriting tmp/transpose.c


In [4]:
!(cd tmp; gcc transpose.c -O3 -o transpose)
!bash -c "time ./tmp/transpose"

Transpose of the matrix:
sum: 1072693248

real	0m1,016s
user	0m1,012s
sys	0m0,004s


In [5]:
!perf record -e cycles,instructions ./tmp/transpose

Transpose of the matrix:
sum: 1072693248
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0,427 MB perf.data (8461 samples) ]


* Examine `perf report` in the terminal.
* Now retry, this time building with `-g` instead of `-O3`

In [6]:
%%writefile tmp/matvec.py

import numpy as np

n = 4096
A = np.random.randn(n, n)
b = np.random.randn(n)

for i in range(10):
    A @ b

Writing tmp/matvec.py


In [9]:
!OPENBLAS_NUM_THREADS=1 perf record python tmp/matvec.py

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,112 MB perf.data (1501 samples) ]


In [23]:
%%writefile tmp/matmat.py

import numpy as np

n = 2048
A = np.random.randn(n, n)
B = np.random.randn(n, n)

for i in range(20):
    A @ B

Overwriting tmp/matmat.py


In [24]:
!OPENBLAS_NUM_THREADS=1 perf record python tmp/matmat.py

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 1,377 MB perf.data (29133 samples) ]


Run in shell separately:
```
perf record \
  -e cycles,L1-dcache-load-misses \
  -e fp_arith_inst_retired.256b_packed_double \
  -c 10 \
  python tmp/matvec.py
```

* Also try `-c 100`

Look at:

* `perf help`
* `perf help record`

Aspects to mention:

* Measuring parts of a program?
* Granularity for ratios?
* Scope of collection
* Call graph collection (`-g`)
* Precise events

## Using pmu-tools / toplev

This uses `toplev.py` from Andi Kleen's [pmu-tools](https://github.com/andikleen/pmu-tools).

* Try the command below for a few different levels.
* Try the command below for the matvec and the matmat.

In [25]:
%%bash

OPENBLAS_NUM_THREADS=1 python ~/pack/pmu-tools/toplev.py -l4 python tmp/matmat.py

Consider disabling nmi watchdog to minimize multiplexing
(echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog or
 echo kernel.nmi_watchdog=0 >> /etc/sysctl.conf ; sysctl -p as root)
BR_MISP_RETIRED.COND_NTAKEN_COST event not found for cpu_core
BR_MISP_RETIRED.COND_TAKEN_COST event not found for cpu_core
BR_MISP_RETIRED.INDIRECT_CALL_COST event not found for cpu_core
BR_MISP_RETIRED.INDIRECT_COST event not found for cpu_core
BR_MISP_RETIRED.RET_COST event not found for cpu_core
MEM_INST_RETIRED.STLB_HIT_LOADS event not found for cpu_core
MEM_INST_RETIRED.STLB_HIT_STORES event not found for cpu_core
TOPDOWN_FE_BOUND.ALL_P event not found for cpu_atom
TOPDOWN_FE_BOUND.ITLB_MISS event not found for cpu_atom
TOPDOWN_BAD_SPECULATION.ALL_P event not found for cpu_atom
TOPDOWN_BE_BOUND.ALL_P event not found for cpu_atom
TOPDOWN_RETIRING.ALL_P event not found for cpu_atom
14 events not counted
# 5.01-full-perf, 4 on 13th Gen Intel(R) Core(TM) i7-1365U [mtl]
core BE               Backend_Bound      

Mismeasured (out of bound values):FP_Arith FP_Vector
13 nodes had zero counts: Branch_Detect Branch_Resteer Cisc Decode Fast_Nuke Mem_Scheduler Non_Mem_Scheduler Nuke Other_FB Predecode Register Reorder_Buffer Serialization
Add --run-sample to find locations
Add --nodes '!+Ports_Utilized_3m*/5,+MUX' for breakdown.


## Using LIKWID

Uses [pylikwid](https://github.com/RRZE-HPC/pylikwid), a wrapper around [likwid](https://github.com/RRZE-HPC/likwid), which offers an analogous [C API](https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr).

In [14]:
!likwid-perfctr -e

This architecture has 39 counters.
Counter tags(name, type<, options>):
BBOX0C1, Home Agent box 0, EDGEDETECT|THRESHOLD|INVERT
BBOX0C2, Home Agent box 0, EDGEDETECT|THRESHOLD|INVERT
BBOX0C3, Home Agent box 0, EDGEDETECT|THRESHOLD|INVERT
BBOX1C1, Home Agent box 1, EDGEDETECT|THRESHOLD|INVERT
BBOX1C2, Home Agent box 1, EDGEDETECT|THRESHOLD|INVERT
BBOX1C3, Home Agent box 1, EDGEDETECT|THRESHOLD|INVERT
MBOX2C1, Memory Controller 0 Channel 2, EDGEDETECT|THRESHOLD|INVERT
MBOX2C2, Memory Controller 0 Channel 2, EDGEDETECT|THRESHOLD|INVERT
MBOX2C3, Memory Controller 0 Channel 2, EDGEDETECT|THRESHOLD|INVERT
MBOX2FIX, Memory Controller 0 Channel 2 Fixed Counter, INVERT
MBOX3C1, Memory Controller 0 Channel 3, EDGEDETECT|THRESHOLD|INVERT
MBOX3C2, Memory Controller 0 Channel 3, EDGEDETECT|THRESHOLD|INVERT
MBOX3C3, Memory Controller 0 Channel 3, EDGEDETECT|THRESHOLD|INVERT
MBOX3FIX, Memory Controller 0 Channel 3 Fixed Counter, INVERT
MBOX6C1, Memory Controller 1 Channel 2, EDGEDETECT|THRESHOLD|INVER

In [3]:
!likwid-perfctr -a

 Group name	Description
--------------------------------------------------------------------------------
  FLOPS_AVX	Packed AVX MFLOP/s
  TLB_INSTR	L1 Instruction TLB miss rate/ratio
       NUMA	Local and remote memory accesses
     ENERGY	Power and Energy consumption
   TLB_DATA	L2 data TLB miss rate/ratio
      CLOCK	Power and Energy consumption
 PORT_USAGE	Execution port utilization
CYCLE_ACTIVITY	Cycle Activities
       UOPS	UOPs execution info
        QPI	QPI Link Layer data
         L2	L2 cache bandwidth in MBytes/s
     CACHES	Cache bandwidth in MBytes/s
     BRANCH	Branch prediction miss rate/ratio
       DATA	Load to store ratio
   RECOVERY	Recovery duration
  UOPS_EXEC	UOPs execution
        MEM	Main memory bandwidth in MBytes/s
 UOPS_ISSUE	UOPs issueing
     ICACHE	Instruction cache miss rate/ratio
    L3CACHE	L3 cache miss rate/ratio
    L2CACHE	L2 cache miss rate/ratio
       SBOX	Ring Transfer bandwidth
         HA	Main memory bandwidth in MBytes/s seen from Home agent
FA

In [28]:
!likwid-perfctr -H -g MEM

Group MEM:
Formulas:
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0))*64.0/runtime
Memory read data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC1))*64.0/runtime
Memory write data volume [GBytes] = 1.0E-09*(SUM(MBOXxC1))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0/runtime
Memory data volume [GBytes] = 1.0E-09*(SUM(MBOXxC0)+SUM(MBOXxC1))*64.0
-
Profiling group to measure memory bandwidth drawn by all cores of a socket.
Since this group is based on Uncore events it is only possible to measure on a
per socket base. Some of the counters may not be available on your system.
Also outputs total data volume transferred from main memory.
The same metrics are provided by the HA group.



In [15]:
%%writefile tmp/perfctr.py

import numpy as np
import likwid

likwid.init_thread()
likwid.init_openmp_threads()

n = 2048

with likwid.Region("generation"):
    A = np.random.randn(n, n)
    b = np.random.randn(n)

with likwid.Region("matmul"):
    A @ A

Overwriting tmp/perfctr.py


Also add `-m` option below.

* Advantages?
* Disadvantages?

Make sure the MSR access daemon is SUID root:

```
chmod u+s /usr/sbin/likwid-accessD
```

In [18]:
!likwid-perfctr -C S0:0-7@S1:0-7 -M 1 -g MEM python3 ./tmp/perfctr.py

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
CPU type:	Intel Xeon Broadwell EN/EP/EX processor
CPU clock:	2.19 GHz
--------------------------------------------------------------------------------
Running without Marker API. Activate Marker API with -m on commandline.
--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+-----------+
|         Event         | Counter |   Core 0   |   Core 1  |   Core 2  |   Core 3  |   Core 4  |   Core 5  |   Core 6  |   Core 7  |  Core 12  |  Core 13  |  Core 14  |  Core 15  |  Core 16  |  Core 17  |  Core 18 |  Core 19  |
+-----------------------+---------+------------+-----------+-----------+-----------+----------