# Performance Analysis Profiling on CPU and GPU cores

Welcome to the webinar _Performance Analysis Profiling on CPU and GPU cores_ in . In this webinar you will learn several techniques for profiling single CPU and GPU applications with an emphasis on supercomputing environments.

## The Coding Environment

The first step is display information about the CPU architecture with the command `lscpu`

In [None]:
!lscpu

In this node, we can observe that the multi-GPU resources connect with the NUMA nodes.

For your work today, you have access to several GPUs in the cloud. Run the following cell to see the GPUs available to you today.

In [None]:
!nvidia-smi topo -m 

While your work today will be on a single node, all the techniques you learn today, in particular CUDAWARE-MPI and NVSHMEM, can be used to run your applications across clusters of multi-GPU nodes.

Let us show the NVLink Status for different GPUs reported from `nvidia-smi`:

In [None]:
!nvidia-smi nvlink --status -i 0

In the end, it gives information about the NUMA memory nodes, with tue `lstopo` command, that is used to show the topology of the system.  

In [None]:
!lstopo --of png > ogbon.png

This will import and display a .png image in Jupyter:

In [None]:
from IPython.display import display
from PIL import Image
path="ogbon.png"
display(Image.open(path))

## Environment Modules on OGBON

```cpp
Currently Loaded Modulefiles:
    1) anaconda3/2022.05 
    2) cuda/11.6         
    3) ucx/1.12.0-cuda-11.6-ofed-5.4
    4) gcc/11.1.0  
    5) openmpi/4.1.1-cuda-11.6-ofed-5.4
    6) intel/parallel-studio-xe/2020.2        
```

## Profiling CPU cores

Profiling sequential algorithms in supercomputational environments, it is necessary to measure the code points that require the highest computational cost of the application so that we can focus our efforts on parallelizing these sections; in this way, we can work intelligently where the code needs to gain performance. 

In [None]:
%%writefile test_cpucores.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void func1(void)
{    
    int n = 1000;  
    int i,j,k;

    int  *A = (int *) malloc (sizeof(int)*n*n);
    int  *B = (int *) malloc (sizeof(int)*n*n);
    int  *C = (int *) malloc (sizeof(int)*n*n);

    for(i=0; i < n; i++){
      for(j=0; j < n; j++){
        A[i*n+j] = rand()%(10-1)*1;
        B[i*n+j] = rand()%(10-1)*1;
      }
    }

    for(i = 0; i < n; i++) 
     for(j = 0; j < n; j++)
       for( k = 0; k < n; k++) 
        C[i*n+j]+=A[i*n+k]*B[k*n+j]; 

    return;
}

void func2(void)
{
    double h, x, s = 0, result;
    int a = 0, b = 1;
    int n = 1000000;
    int i;

    h = (b - a) / n;

    for(i = 0; i < n; i++) 
    {
       x = (a + h * (i + 0.5));
       s += 100 * x + sin(2 * x * 3.14159);
    }

    result = h * s;
    
    return;
}

int main(int argc, char **argv)
{
    printf("Inside main()\n");

    printf("Inside func1()\n");
    func1();
    
    printf("Inside func2()\n");
    for(int i = 0; i < 500; i++)
      func2();

    return 0;
}

### GPROF

GNU profiler (gprof) will be used,  whose primary function is to analyze and capture times during code execution, generating performance reports on multiprocessor environments. To execute the profiling process, insert the _-pg_ argument in the compilation of our sequential code, run it usually to generate the binary file of the report, and, soon after, display it in a readable way through the command associated with gprof, illustrates up as follows:

In [None]:
!gcc test_cpucores.c -o test_gprof -lm -pg 

In [None]:
!./test_gprof

In [None]:
!gprof -b test_gprof gmon.out

### VTUNE

The Intel(R) VTune(TM) Profiler Command Line Tool (vtune) perform the hotspots collection based on user mode sampling on the given target.

In [None]:
!gcc test_cpucores.c -o test_vtune -lm

In [None]:
!vtune -collect hotspots ./test_vtune

### PERF

The Perf Tool Performs (perf) performance analysis using counters, mainly referring to cache memories. A simple matrix multiply in the following can show this:

In [None]:
%%writefile mm.c
#include <stdio.h>
#include <stdlib.h>

void initializeMatrix(int *A, int n){

  for(int i=0; i < n; i++)
    for(int j=0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1;
  
}

int main(int argc, char **argv)
{
 int n = atoi(argv[1]);  
 int i,j,k;

 int  *A = (int *) malloc (sizeof(int)*n*n);
 int  *B = (int *) malloc (sizeof(int)*n*n);
 int  *C = (int *) malloc (sizeof(int)*n*n);

 initializeMatrix(A,n);
 initializeMatrix(B,n);

 for(j = 0; j < n;  j++)
    for(i = 0; i < n; i++) 
      for(k = 0; k < n;  k++) 
          C[ i * n + j ] += A[ i * n + k ] * B[ k * n + j ];

/*
 * TODO: Mensure the performance with the loop (i, j, k)
 */
    
/*
 * TODO: Mensure the performance with the loop (i, k, j)
 */

 return 0;
}

In [None]:
!gcc mm.c -o mm

In [None]:
!perf stat -d ./mm 1024

After profiling the application with the loop i, j, k, answer the following questions using two new experiments:

- Loop i, j, k;
- Loop i, k, j;

and answer the following questions using information displayed in the profiling before:

- Were there any differences in code structure in performance? And if so, what would be the justification for this?
- How does optimization relate to the concept of memory locality?

## Profiling GPU cores

The GPU has many units working in parallel, and it is common for it to be bound by different units at different frame sequences. Due to the nature of this behavior, it is beneficial to identify where the GPU cost is going when searching for bottlenecks and to understand what a GPU bottleneck is. Some applications help developers identify bottlenecks, which is useful when optimizing performance, following some NVIDIA profiling tools.

In [None]:
%%writefile vector-add.cu
#include <stdio.h>
#include <cuda.h>

void initWith(float num, float *a, int N)
{
  for(int i = 0; i < N; ++i)
  {
    a[i] = num;
  }
}

__global__ 
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main(int argc, char **argv)
{
  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  initWith(3, a, N);
  initWith(4, b, N);
  initWith(0, c, N);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;
  cudaGetDeviceProperties(&props, deviceId);
  int multiProcessorCount = props.multiProcessorCount;
  threadsPerBlock = 1024;
  numberOfBlocks = 32 * multiProcessorCount;
  
  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

In [None]:
!nvcc vector-add.cu -o vector-add

### NSYS

NVIDIA Nsight Systems (nsys) is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of GPUs.

The command `nsys profile` will generate a `qdrep` report file which can be used in a variety of manners. We use the `--stats=true` flag here to indicate we would like summary statistics printed. There is quite a lot of information printed:

- Profile configuration details
- Report file(s) generation details
- **CUDA API Statistics**
- **CUDA Kernel Statistics**
- **CUDA Memory Operation Statistics (time and size)**
- OS Runtime API Statistics

In this lab you will primarily be using the nsys im command line. In the next, you will be using the generated report files to give to the Nsight Systems GUI for visual profiling.

In [None]:
!nsys profile --stats=true ./vector-add

After profiling the application, answer the following questions using information displayed in the `CUDA Kernel Statistics` section of the profiling output:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run? Record this time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

### NCU

NVIDIA Nsight Compute CLI (ncu) provides a non-interactive way to profile applications from the command line. It can print the results directly on the command line or store them in a report file. 

To print profiling information on the command line on the NCU, do not specify the output file (flag -o). Or, if you want to generate the output file (-o) and still see it on the command line, you can use the --page flag.

In [None]:
!ncu --set detailed vector-add

### Question:

NextAfter profiling the application, with ncu, answer the following question:

- What was the principal diference between nsys and ncu?

## Next

Please continue to the next notebook: [_Visual-Performance-Analysis-Tool_](02-Visual-Performance-Analysis-Tool.ipynb).