# Performance Analysis Profilling on CPU and GPU cores

Welcome to the webinar _Performance Analysis Profilling on CPU and GPU cores_. In this webinar you will learn several techniques for profilling single CPU and GPU applications with an emphasis on supercomputing environments.

## The Coding Environment

The first step is display information about the CPU architecture with the command `lscpu`

In [None]:
!lscpu

In this node, we can observe that the multi-GPU resources connect with the NUMA nodes.

For your work today, you have access to several GPUs in the cloud. Run the following cell to see the GPUs available to you today.

In [None]:
!nvidia-smi topo -m 

While your work today will be on a single node, all the techniques you learn today, can be used to run your applications across clusters of multi-GPU nodes.

Let us show the NVLink Status for different GPUs reported from `nvidia-smi`:

In [None]:
!nvidia-smi nvlink --status -i 0

In the end, it gives information about the NUMA memory nodes, with tue `lstopo` command, that is used to show the topology of the system.  

In [None]:
!lstopo --of png > airis.png

This will import and display a _.png_ image in Jupyter:

In [None]:
from IPython.display import display
from PIL import Image
path="airis.png"
display(Image.open(path))

## Environment Modules on AIRIS 

These modules must be initialized before running the jupyter-notebook:
```cpp
Currently Loaded Modulefiles:
    1) anaconda3/2022.05 
    2) cuda/11.6         
    3) intel/vtune/2023.2.0 
    4) python/3.11.5 
    5) intel/compiler-rt/2023.2.1
    6) openmpi/4.1.5.3
    7) intel/icc 
```

## Profilling CPU cores

Profilling in supercomputational environments is a form of dynamic program analysis that measures. It is necessary to measure the code points that require the highest computational cost of the application so that we can focus our efforts on parallelizing these sections; in this way, we can work intelligently where the code needs to gain performance. 

In [None]:
%%writefile test_cpucores.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void func1(void)
{    
    int n = 1024;  
    int i,j,k;

    int  *A = (int *) malloc (sizeof(int)*n*n);
    int  *B = (int *) malloc (sizeof(int)*n*n);
    int  *C = (int *) malloc (sizeof(int)*n*n);

    for(i = 0; i < n; i++){
      for(j = 0; j < n; j++){
        A[i*n+j] = rand()%(10-1)*1;
        B[i*n+j] = rand()%(10-1)*1;
      }
    }

    for(i = 0; i < n; i++) 
     for(j = 0; j < n; j++)
       for( k = 0; k < n; k++) 
        C[i*n+j]+=A[i*n+k]*B[k*n+j]; 

    return;
}

void func2(void)
{
    double h, x, s = 0, result;
    int a = 0, b = 1;
    int n = 1000000;
    
    h = (b - a) / n;

    for(int i = 0; i < n; i++) 
    {
       x = (a + h * (i + 0.5));
       s += 100 * x + sin(2 * x * 3.14159);
    }

    result = h * s;
    
    return;
}

int main(int argc, char **argv)
{
    printf("Inside main()\n");

    printf("Inside func1()\n");
    func1();
    
    printf("Inside func2()\n");
    for(int i = 0; i < 500; i++)
      func2();

    return 0;
}

### ⊗ GPROF

`GNU profile` (gprof) will be used,  whose primary function is to analyze and capture times during code execution, generating performance reports on multiprocessor environments. To execute the profilling process, insert the _-pg_ argument in the compilation of our sequential code, run it usually to generate the binary file of the report, and, soon after, display it in a readable way through the command associated with gprof, illustrates up as follows:

In [None]:
!gcc test_cpucores.c -o test_gprof -lm -pg 

In [None]:
!./test_gprof

In [None]:
!gprof -b test_gprof gmon.out

### ⊗ VTUNE

The `Intel(R) VTune(TM) Profiler` Command Line Tool (vtune) perform the hotspots collection based on user mode sampling on the given target, illustrates up as follows:

In [None]:
!gcc -g test_cpucores.c -o test_vtune -lm

In [None]:
!vtune -collect hotspots -r dir_results ./test_vtune

#### Using the gprof2dot to Visualize the Call Graph

- Download

  * [Git repository](https://github.com/jrfonseca/gprof2dot)


In [None]:
!vtune -report gprof-cc -result-dir dir_results/ -format text -report-output output.txt

In [None]:
!/home/murilo/gprof2dot/gprof2dot.py -f axe output.txt | dot -Tpng -o output.png

In [None]:
from IPython.display import display
from PIL import Image
path="output.png"
display(Image.open(path))

### ☆ Question:

- How make profilling in a MPI code?

In [None]:
%%writefile mm-mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <mpi.h>

void mms(double *a, int fa, int ca, int lda, double *b, int fb, int cb, int ldb, double *c, int fc, int cc, int ldc) {
    int i, j, k;
    double s;
    for (i = 0; i < fa; i++) 
        for (j = 0; j < cb; j++) {
            s = 0.;
            for (k = 0; k < ca; k++)
                s += a[i * lda + k] * b[k * ldb + j];
            c[i * ldc + j] = s;
        }
}

void mm(double *a, int fa, int ca, int lda, double *b, int fb, int cb, int ldb, double *c, int fc, int cc, int ldc, int nodo, int np) {
    int i, j, k;
    double s;
    if (nodo == 0) {
        for (i = 1; i < np; i++)
            MPI_Send(&a[i * lda * fa / np], fa / np * ca, MPI_DOUBLE, i, 20, MPI_COMM_WORLD);
        MPI_Bcast(b, fb * cb, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    } else {
        MPI_Recv(a, fa / np * ca, MPI_DOUBLE, 0, 20, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        MPI_Bcast(b, fb * cb, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    }
    mms(a, fa / np, ca, lda, b, fb, cb, ldb, c, fc / np, cc, ldc);
    if (nodo == 0)
        for (i = 1; i < np; i++)
            MPI_Recv(&c[i * ldc * fc / np],fc / np * cc, MPI_DOUBLE, i, 30, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    else
        MPI_Send(c, fc / np * cc, MPI_DOUBLE, 0, 30, MPI_COMM_WORLD);
}

void initialize(double *m, int f, int c, int ld) {
  int i, j;

  for (i = 0; i < f; i++) {
    for (j = 0; j < c; j++) {  
      m[i * ld + j] = (double)(i + j);
    }
  }
}

void initializealea(double *m, int f, int c, int ld) {
  int i, j;

  for (i = 0; i < f; i++) {
    for (j = 0; j < c; j++) {  
      m[i * ld + j] = (double)rand() / RAND_MAX;
    }
  }
}

void escribir(double *m, int f, int c, int ld) {
  int i, j;

  for (i = 0; i < f; i++) {
    for (j = 0; j < c; j++) {  
      printf("%.4lf ",m[i * ld + j]);
    }
    printf("\n");
  }
}

void comparar(double *m1, int fm1, int cm1, int ldm1, double *m2, int fm2, int cm2, int ldm2)
{
  int i, j;

  for(i = 0; i < fm1; i++)
    for(j = 0; j < cm1; j++) {
      if(m1[i * ldm1 + j] != m2[i * ldm2 + j]) {
        printf("Diferencia en %d,%d: %.8lf , %.8lf\n", i, j, m1[i * ldm1 + j], m2[i * ldm2 + j]);
        return;
      }
    }
}

int main(int argc, char *argv[]) 
{
  int nodo, np, i, j, fa, fal, ca, lda, fb, cb, ldb, fc, fcl, cc, ldc, N;
  int long_name;
  double ti, tf;
  double *a, *b, *c, *c0;
  char    nombre_procesador[MPI_MAX_PROCESSOR_NAME];
  MPI_Status estado;
 
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &np);
  MPI_Comm_rank(MPI_COMM_WORLD, &nodo);
  MPI_Get_processor_name(nombre_procesador, &long_name);

  if (nodo == 0) {
    N = atoi(argv[1]);
  }

  MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);
  
  fa = ca = lda = fb = cb = ldb = fc = cc = ldc = N;
  fal = N / np;
  fcl = N / np;
  if (nodo == 0) {
    a = (double *) malloc(sizeof(double) * fa * ca);
    b = (double *) malloc(sizeof(double) * fb * cb);
    c = (double *) malloc(sizeof(double) * fc * cc);
  } else {
    a = (double *) malloc(sizeof(double) * fal * ca);
    b = (double *) malloc(sizeof(double) * fb * cb);
    c = (double *) malloc(sizeof(double) * fcl * cc);
  }
  
  if (nodo == 0) {
    c0 = (double *) malloc(sizeof(double) * fc * cc);
    initialize(a, fa, ca, lda);
    initialize(b, fb, cb, ldb);

    mms(a, fa, ca, lda, b, fb, cb, ldb, c0, fc, cc, ldc);
  }

  MPI_Barrier(MPI_COMM_WORLD);

  ti = MPI_Wtime();

  mm(a, fa, ca, lda, b, fb, cb, ldb, c, fc, cc, ldc, nodo, np);

  MPI_Barrier(MPI_COMM_WORLD);
  tf = MPI_Wtime();
  if (nodo == 0) {
    printf("Proceso %d, %s, Tiempo %.6lf\n", nodo, nombre_procesador, tf - ti);
    comparar(c, fc, cc, ldc, c0, fc, cc, ldc);
  }
  
  free(a);
  free(b);
  free(c);
  if (nodo == 0)
    free(c0);
  MPI_Finalize();
}

In [None]:
!mpicc -g mm-mpi.c -o mm-mpi

In [None]:
!mpirun -np 4 vtune -collect hotspots -r results ./mm-mpi 1600

In [None]:
!vtune -report gprof-cc -result-dir results.login2/ -format text -report-output output-mm-mpi.txt

In [None]:
!/home/murilo/gprof2dot/gprof2dot.py -f axe output-mm-mpi.txt | dot -Tpng -o output-mm-mpi.png

In [None]:
from IPython.display import display
from PIL import Image
path="output-mm-mpi.png"
display(Image.open(path))

### ⊗ PERF

The `Perf Tool Perform` (perf) performance analysis using counters, mainly referring to cache memories. A simple matrix multiply in the following can show this:

In [None]:
%%writefile mm.c
#include <stdio.h>
#include <stdlib.h>

void initializeMatrix(int *A, int n)
{
  for(int i=0; i < n; i++)
    for(int j=0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1; 
}

int main(int argc, char **argv)
{
 int n = atoi(argv[1]);  
 int i,j,k;

 int  *A = (int *) malloc (sizeof(int)*n*n);
 int  *B = (int *) malloc (sizeof(int)*n*n);
 int  *C = (int *) malloc (sizeof(int)*n*n);

 initializeMatrix(A,n);
 initializeMatrix(B,n);

 for(i = 0; i < n;  i++)
    for(j = 0; j < n; j++) 
      for(k = 0; k < n;  k++) 
          C[ i * n + j ] += A[ i * n + k ] * B[ k * n + j ];

/*
 * TODO: Mensure the performance with the loop (j, i, k)
 */
    
/*
 * TODO: Mensure the performance with the loop (i, k, j)
 */

 return 0;
}

In [None]:
!gcc mm.c -o mm

In [None]:
!perf stat -d ./mm 1024

After profilling the application with the loop i, j, k, answer the following questions using two new experiments:

- Loop (j, i, k);
- Loop (i, k, j);

and answer the following questions using information displayed in the profilling before:

### ☆ Questions:

- Were there any differences in code structure in performance? And if so, what would be the justification for this?
- How does optimization relate to the concept of memory locality?

## Profilling GPU cores

The GPU has many units working in parallel, and it is common for it to be bound by different units at different frame sequences. Due to the nature of this behavior, it is beneficial to identify where the GPU cost is going when searching for bottlenecks and to understand what a GPU bottleneck is. Some applications help developers identify bottlenecks, which is useful when optimizing performance, following some NVIDIA profilling tools.

In [None]:
%%writefile vector-add.cu
#include <stdio.h>
#include <cuda.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
    a[i] = num;
  
}

__global__ 
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
    result[i] = a[i] + b[i];
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main(int argc, char **argv)
{
  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;
  cudaGetDeviceProperties(&props, deviceId);
  int multiProcessorCount = props.multiProcessorCount;
  threadsPerBlock = 1024;
  numberOfBlocks = 32 * multiProcessorCount;
  
  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

In [None]:
!nvcc vector-add.cu -o vector-add

### ⊗ NSYS

`NVIDIA Nsight Systems` (nsys) is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of GPUs.

The command `nsys profile` will generate a `qdrep` report file which can be used in a variety of manners. We use the `--stats=true` flag here to indicate we would like summary statistics printed. There is quite a lot of information printed:

- Profile configuration details
- Report file(s) generation details
- **CUDA API Statistics**
- **CUDA Kernel Statistics**
- **CUDA Memory Operation Statistics (time and size)**
- OS Runtime API Statistics

In this lab you will primarily be using the nsys im command line. In the next, you will be using the generated report files to give to the Nsight Systems GUI for visual profilling.

In [None]:
!nsys profile --stats=true ./vector-add

After profilling the application, answer the following questions using information displayed in the `CUDA Kernel Statistics` section of the profilling output.

### ☆ Questions:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run? Record this time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

### ⊗ NCU

`NVIDIA Nsight Compute` (ncu) provides a non-interactive way to profile applications from the command line. It can print the results directly on the command line or store them in a report file. 

To print profilling information on the command line on the NCU, do not specify the output file (flag -o). Or, if you want to generate the output file (-o) and still see it on the command line, you can use the --page flag.

In [None]:
!ncu --set detailed vector-add

### ☆ Questions:

After profilling the application, with ncu, answer the following question:

- What was the principal diference between nsys and ncu?

## Clear the Temporary Files

Before moving on, please execute the following cell to clear up the directory. This is required to move on to the next notebook.

In [1]:
!rm -rf dir_results results.login2 mm-mpi* output.png outputLU.png airis.png r000hs gmon.out mm* report1* test_* vector-* ../Documents ../intel *.txt dir_results.*

## Next

Please continue to the next notebook: [_Visual-Performance-Analysis-Tool_](02-Visual-Performance-Analysis-Tool.ipynb).