# Using Performance Counters

In [1]:
!mkdir -p tmp

## Using `perf`

A Linux tool for accessing performance counters.

See also the [Wiki documentation](https://perf.wiki.kernel.org/index.php/Main_Page) for `perf`.

In [2]:
!perf list

Examine the output of the following in a terminal:

* `perf top`
* `perf top -z`
* `perf top -e cache-misses`
* `perf top -e cache-misses,cycles`


In [3]:
%%writefile tmp/transpose.c

#include <stdio.h>
#include <stdlib.h>
 
int main()
{
    const int m = 1024;
    const int n = 1024;
    int *matrix = malloc(sizeof(int) * m * n);
    int *transpose = malloc(sizeof(int) * m * n);
    
    for (int c = 0; c < m; c++)
       for(int d = 0; d < n; d++)
          matrix[c*m + d] = c+d;

    for (int i = 0; i < 300; ++i)
        for (int c = 0; c < m; c++)
           for(int d = 0 ; d < n ; d++)
              transpose[d*n + c] = matrix[c*m + d];
 
    printf("Transpose of the matrix:\n");
 
    int sum = 0;
    for (int c = 0; c < n; c++)
       for (int d = 0; d < m; d++)
          sum += transpose[d*n + c];
    printf("sum: %d\n", sum);

    return 0;
}

In [4]:
!(cd tmp; gcc transpose.c -O3 -o transpose)
!bash -c "time ./tmp/transpose"

In [5]:
!perf record -e cycles,instructions ./tmp/transpose

* Examine `perf report` in the terminal.
* Now retry, this time building with `-g` instead of `-O3`

In [6]:
%%writefile tmp/matvec.py

import numpy as np

n = 4096
A = np.random.randn(n, n)
b = np.random.randn(n)

for i in range(10):
    A @ b

In [7]:
!perf record python tmp/matvec.py

In [8]:
%%writefile tmp/matmat.py

import numpy as np

n = 2048
A = np.random.randn(n, n)
B = np.random.randn(n, n)

for i in range(10):
    A @ B

In [9]:
!perf record python tmp/matmat.py

Run in shell separately:
```
perf record \
  -e cycles,L1-dcache-load-misses \
  -e fp_arith_inst_retired.256b_packed_double \
  -c 10 \
  python tmp/matvec.py
```

* Also try `-c 100`

Look at:

* `perf help`
* `perf help record`

Aspects to mention:

* Measuring parts of a program?
* Granularity for ratios?
* Scope of collection
* Call graph collection (`-g`)
* Precise events

## Using pmu-tools / toplev

This uses `toplev.py` from Andi Kleen's [pmu-tools](https://github.com/andikleen/pmu-tools).

* Try the command below for a few different levels.
* Try the command below for the matvec and the matmat.

In [28]:
%%bash

python2.7 ~/pack/pmu-tools/toplev.py -l3 python tmp/matvec.py

## Using LIKWID

Uses [pylikwid](https://github.com/RRZE-HPC/pylikwid), a wrapper around [likwid](https://github.com/RRZE-HPC/likwid), which offers an analogous [C API](https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr).

In [14]:
!likwid-perfctr -e

In [3]:
!likwid-perfctr -a

In [28]:
!likwid-perfctr -H -g MEM

In [15]:
%%writefile tmp/perfctr.py

import numpy as np
import likwid

likwid.init_thread()
likwid.init_openmp_threads()

n = 2048

with likwid.Region("generation"):
    A = np.random.randn(n, n)
    b = np.random.randn(n)

with likwid.Region("matmul"):
    A @ A

Also add `-m` option below.

* Advantages?
* Disadvantages?

Make sure the MSR access daemon is SUID root:

```
chmod u+s /usr/sbin/likwid-accessD
```

In [18]:
!likwid-perfctr -C S0:0-7@S1:0-7 -M 1 -g MEM python3 ./tmp/perfctr.py