# Hands-on: High Performance Computing applied to Industry

Welcome to _Hands-on_. In this short course you will learn several techniques for scaling computation on industrial applications, with an emphasis on [OPENMP, OPENACC, CUDA] which allows for elegant parallelization applications codes and has been proven to scale very well on supercomputational systems.

## The Coding Environment

For your work today, you have access to several computational resources in the cloud. Run the following cell to see the features available to you today.

In [1]:
!nvidia-smi

Fri Mar 24 11:49:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:60:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   42C    P0    46W / 300W |      0MiB / 32768MiB |      0%      Default |
|       

In [2]:
!nvidia-smi topo -m 

	[4mGPU0	GPU1	GPU2	GPU3	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity[0m
GPU0	 X 	NV2	NV2	NV2	PIX	SYS	0-17,36-53	0
GPU1	NV2	 X 	NV2	NV2	PIX	SYS	0-17,36-53	0
GPU2	NV2	NV2	 X 	NV2	SYS	PIX	18-35,54-71	1
GPU3	NV2	NV2	NV2	 X 	SYS	PIX	18-35,54-71	1
mlx5_0	PIX	PIX	SYS	SYS	 X 	SYS		
mlx5_1	SYS	SYS	PIX	PIX	SYS	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


While your work today will be on a single node, all the techniques you learn today, can be used to run your applications across clusters of multi-GPU nodes.

## Environment Modules on Supercomputing Environment

These modules must be initialized before running the jupyter-notebook:
```cpp
Currently Loaded Modulefiles:
    1) anaconda3/2022.05 
    2) cuda/11.6         
    3) ucx/1.12.0-cuda-11.6-ofed-5.4
    4) gcc/11.1.0  
    5) openmpi/4.1.1-cuda-11.6-ofed-5.4
    6) intel/parallel-studio-xe/2020.2    
    7) pgi/2019.2
    8) llvm/11.0.0
```

## Table of Contents

During the workshop today you will work through each of the following notebooks with your instructor:

- [Accelerate a Thermal Conductivity Application](2-heat.ipynb): You will begin by familiarizing yourself with a single GPU implementation of the Accelerate a Thermal Conductivity Application, which we will use to introduce  multi-resources programming paradigms.
- [Seismic Modelling - 1D Wave Equation](3-wave.ipynb): You apply your day's learnings by refactoring a single GPU 1D wave equation solver to run on supercomputing environment.
- [Final Exercise](4-finalExercise.ipynb): In this exercise you apply your concepts.

## Matrix Multiple Benchmark

### A Single CPU Implementation 

In [4]:
%%writefile mm.c
#include <stdio.h>
#include <stdlib.h>

void fill_matrix(double *A, int n){
 
  for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1;
  
}

void print_matrix(double *A, int n){

  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
    printf("\n");
  }

  printf("\n");

}

int main(int argc, char **argv){

 int n = atoi(argv[1]);  
 int i, j, k;

 double  *A = (double *) malloc (sizeof(double) * n * n);
 double  *B = (double *) malloc (sizeof(double) * n * n);
 double  *C = (double *) malloc (sizeof(double) * n * n);

 fill_matrix(A,n);
 fill_matrix(B,n);

 for(i = 0; i < n; i++) 
  for(j = 0; j < n; j++)
    for(k = 0; k < n; k++) 
      C[i*n+j]+=A[i*n+k]*B[k*n+j];

 print_matrix(A,n);
 print_matrix(B,n);
 print_matrix(C,n);

 return 0;

}

Writing mm.c


In [5]:
!gcc mm.c -o mm

In [6]:
!./mm 12

1.00	7.00	0.00	7.00	5.00	7.00	1.00	3.00	6.00	1.00	5.00	4.00	
5.00	7.00	5.00	4.00	6.00	0.00	7.00	1.00	8.00	8.00	6.00	6.00	
8.00	8.00	8.00	4.00	1.00	1.00	5.00	0.00	0.00	3.00	5.00	3.00	
1.00	7.00	4.00	7.00	6.00	0.00	0.00	2.00	5.00	4.00	5.00	2.00	
2.00	3.00	2.00	1.00	1.00	8.00	8.00	0.00	5.00	5.00	4.00	4.00	
6.00	0.00	5.00	6.00	2.00	8.00	7.00	3.00	4.00	2.00	0.00	0.00	
0.00	0.00	2.00	6.00	2.00	5.00	6.00	5.00	7.00	6.00	6.00	8.00	
5.00	3.00	6.00	2.00	8.00	1.00	6.00	6.00	8.00	0.00	1.00	1.00	
7.00	0.00	3.00	2.00	0.00	1.00	2.00	1.00	8.00	3.00	5.00	2.00	
6.00	0.00	7.00	2.00	7.00	2.00	8.00	1.00	6.00	5.00	1.00	5.00	
4.00	6.00	0.00	4.00	6.00	2.00	3.00	2.00	0.00	4.00	3.00	7.00	
5.00	3.00	6.00	5.00	4.00	2.00	5.00	2.00	1.00	3.00	2.00	8.00	

3.00	2.00	0.00	0.00	7.00	2.00	4.00	3.00	6.00	2.00	5.00	1.00	
2.00	6.00	4.00	2.00	2.00	7.00	8.00	5.00	1.00	5.00	1.00	4.00	
8.00	4.00	6.00	7.00	5.00	8.00	6.00	0.00	8.00	4.00	0.00	7.00	
4.00	2.00	8.00	1.00	5.00	2.00	3.00	7.00	8.00	7.00	8.00	1.00	
3.00	7.00	5.00	2.00	3.0

#### How to measure execution time in C code?

In [13]:
%%writefile mm-time.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

void fill_matrix(double *A, int n){ 
  for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1;
  
}

void print_matrix(double *A, int n){

  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
    printf("\n");
  }

  printf("\n");

}

int main(int argc, char **argv)
{
  int n = atoi(argv[1]);  
  int i, j, k;
  struct timeval begin, end;  

  double  *A = (double *) malloc (sizeof(double) * n * n);
  double  *B = (double *) malloc (sizeof(double) * n * n);
  double  *C = (double *) malloc (sizeof(double) * n * n);

  fill_matrix(A, n);
  fill_matrix(B, n);

  // Start measuring time
  gettimeofday(&begin, 0);
    
  for(i = 0; i < n; i++) 
   for(j = 0; j < n; j++)
     for(k = 0; k < n; k++) 
       C[i*n+j] += A[i*n+k] * B[k*n+j];
  
  // End measuring time
  gettimeofday(&end, 0);
    
  long seconds = end.tv_sec - begin.tv_sec;
  long microseconds = end.tv_usec - begin.tv_usec;
  double elapsed = seconds + microseconds * 1e-6;
    
  printf("%d x %d  %.2f seconds\n", n, n, elapsed);  

  //print_matrix(A,n);
  //print_matrix(B,n);
  //print_matrix(C,n);

  return 0;
}

Overwriting mm-time.c


In [14]:
!gcc mm-time.c -o mm-time

In [15]:
!./mm-time 1000

1000 x 1000  4.55 seconds


#### How to profilling this code?

##### GPROF

In [16]:
!gcc mm-time.c -o mm-profilling-gprof -pg

In [17]:
!./mm-profilling-gprof 1000

1000 x 1000  4.36 seconds


In [18]:
!gprof -b mm-profilling-gprof gmon.out

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
100.41      4.39     4.39                             main
  0.23      4.40     0.01        2     5.03     5.03  fill_matrix

			Call graph


granularity: each sample hit covers 2 byte(s) for 0.23% of 4.40 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    4.39    0.01                 main [1]
                0.01    0.00       2/2           fill_matrix [2]
-----------------------------------------------
                0.01    0.00       2/2           main [1]
[2]      0.2    0.01    0.00       2         fill_matrix [2]
-----------------------------------------------

Index by function name

   [2] fill_matrix             [1] main


##### VTUNE

In [19]:
!gcc mm-time.c -o mm-profilling-vtune

In [20]:
!vtune -collect hotspots ./mm-profilling-vtune 1000

vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /home/murilo/parallel-computing-applied/r000hs -command stop.
1000 x 1000  4.51 seconds
vtune: Collection stopped.
vtune: Using result path `/home/murilo/parallel-computing-applied/r000hs'
vtune: Executing actions 19 % Resolving information for `libc.so.6'            
vtune: Executing actions 75 % Generating a report                              Elapsed Time: 4.632s
    CPU Time: 4.510s
        Effective Time: 4.510s
            Idle: 0s
            Poor: 4.510s
            Ok: 0s
            Ideal: 0s
            Over: 0s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 1
    Paused Time: 0s

Top Hotspots
Function     Module               CPU Time
-----------  -------------------  --------
main         mm-profilling-vtune    4.500s
fill_matrix  mm-profilling-vtune    0.010s
Effective Physical Core Utilization: 2.7% (0.977 out of 36)
 | The metric

### OpenMP

In [21]:
%%writefile mm-omp.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>

void fill_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1;
}

void print_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
   printf("\n");
  }

  printf("\n");
}

int main(int argc, char **argv)
{
  int n = atoi(argv[1]);  
  int i, j, k;
  struct timeval begin, end;
  
  double  *A = (double *) malloc(sizeof(double) * n * n);
  double  *B = (double *) malloc(sizeof(double) * n * n);
  double  *C = (double *) malloc(sizeof(double) * n * n);

  fill_matrix(A,n);
  fill_matrix(B,n);

  gettimeofday(&begin, 0);
     
  #pragma omp parallel for private(i,j,k)
   for(i = 0; i < n; i++) 
    for(j = 0; j < n; j++)
      for(k = 0; k < n; k++) 
        C[i*n+j] += A[i*n+k] * B[k*n+j];
    
   gettimeofday(&end, 0);
  
   long seconds = end.tv_sec - begin.tv_sec;
   long microseconds = end.tv_usec - begin.tv_usec;
   double elapsed = seconds + microseconds*1e-6;
    
   printf("%d x %d  %.2f seconds\n", n, n, elapsed);    
    
   //print_matrix(A,n);
   //print_matrix(B,n);
   //print_matrix(C,n);

   return 0;
}

Writing mm-omp.c


In [22]:
!gcc mm-omp.c -o mm-omp -fopenmp -O3

In [29]:
!OMP_NUM_THREADS=6 ./mm-omp 1000

1000 x 1000  0.21 seconds


### CUDA

In [30]:
%%writefile mm-CUDA.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <sys/time.h>

__global__ void kernel(double *A, double *B, double *C, int n) 
{  
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;

  if(i < n && j < n)
    for( int k = 0; k < n; k++) 
       C[i*n+j] += A[i*n+k] * B[k*n+j];

}
 
void mult_matrix_cpu(double *A, double *B, double *C, int n) 
{
   for(int i = 0; i < n; i++) 
      for(int j = 0; j < n; j++)
         for(int k = 0; k < n; k++) 
            C[i*n+j]+=A[i*n+k]*B[k*n+j];
        
}

void fill_matrix(double *A, int n)
{ 
   for(int i=0; i < n; i++)
     for(int j=0; j < n; j++)
       A[i*n+j] = rand()%(10-1)*1;
   
}

void print_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
    printf("\n");
  }

  printf("\n");

}

int main(int argc, char **argv)
{
    int n = atoi(argv[1]);
    int sizeblock = atoi(argv[2]);
    struct timeval begin, end;

    /*Host*/
    double *A_host=(double *) malloc (n * n * sizeof(double));
    double *B_host=(double *) malloc (n * n * sizeof(double));
    double *C_host=(double *) malloc (n * n * sizeof(double));
        
    fill_matrix(A_host,n);
    fill_matrix(B_host,n);
      
    //print_matrix(A_host,n);
    //print_matrix(B_host,n);

    gettimeofday(&begin, 0);
    
    /*Device*/
    double *A_device;
    double *B_device;
    double *C_device;

    cudaMalloc((void**)&A_device, n * n * sizeof(double) ); 
    cudaMalloc((void**)&B_device, n * n * sizeof(double) ); 
    cudaMalloc((void**)&C_device, n * n * sizeof(double) ); 

    cudaMemcpy(A_device, A_host, n * n * sizeof(double), cudaMemcpyHostToDevice ); 
    cudaMemcpy(B_device, B_host, n * n * sizeof(double), cudaMemcpyHostToDevice ); 

    /*Computational GRID: (Grid: 2D Block: 2D)*/
    dim3 NUMBER_OF_BLOCKS ( (int) ceil( (float) n / sizeblock), (int) ceil( (float)n / sizeblock) );
    dim3 NUMBER_OF_THREADS( sizeblock, sizeblock);  

          kernel<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS >>>(A_device, B_device, C_device, n);      
          cudaDeviceSynchronize();

    cudaMemcpy(C_host, C_device, n * n * sizeof(double), cudaMemcpyDeviceToHost ); 

    //print_matrix(C_host, n );

    gettimeofday(&end, 0);
    
    long seconds = end.tv_sec - begin.tv_sec;
    long microseconds = end.tv_usec - begin.tv_usec;
    double elapsed = seconds + microseconds*1e-6;
    
    printf("%d x %d  %.3f seconds\n", n, n, elapsed);  
    
    cudaFree(A_device );
    cudaFree(B_device );
    cudaFree(C_device );
  
    return 0;
}

Writing mm-CUDA.cu


In [31]:
!nvcc mm-CUDA.cu -o mm-CUDA

In [33]:
!./mm-CUDA 1000 64

1000 x 1000  0.197 seconds


### OpenACC

In [34]:
%%writefile mm-openacc.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

void fill_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1; 
}

void print_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
    printf("\n");
  }

  printf("\n");

}

int main(int argc, char **argv)
{
  int n = atoi(argv[1]);  
  int i, j, k;
  struct timeval begin, end;
 
  double *A = (double *) malloc (sizeof(double) * n * n);
  double *B = (double *) malloc (sizeof(double) * n * n);
  double *C = (double *) malloc (sizeof(double) * n * n);

  fill_matrix(A,n);
  fill_matrix(B,n);
 
  gettimeofday(&begin, 0);
      
  #pragma acc data present_or_copyin(A[:n*n], B[:n*n], n) copyout(C[:n*n])
   #pragma acc parallel 
    #pragma acc loop
     for(i = 0; i < n; i++) 
       for(j = 0; j < n; j++)
         for(k = 0; k < n; k++) 
           C[i*n+j] += A[i*n+k] * B[k*n+j];

   gettimeofday(&end, 0); 
  
   long seconds = end.tv_sec - begin.tv_sec;
   long microseconds = end.tv_usec - begin.tv_usec;
   double elapsed = seconds + microseconds*1e-6;
    
    printf("%d x %d  %.2f seconds\n", n, n, elapsed);  
     
  //print_matrix(A,n);
  //print_matrix(B,n);
  //print_matrix(C,n);

  return 0;
}

Writing mm-openacc.c


In [35]:
!pgcc mm-openacc.c -o mm-openacc -acc


Please obtain a new version at http://www.pgroup.com/community or
contact PGI Sales at sales@pgroup.com to obtain a perpetual license.



In [36]:
!./mm-openacc 1000

1000 x 1000  0.64 seconds


### OpenMP5

In [37]:
%%writefile mm-omp5.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>

void fill_matrix(double *A, int n)
{ 
  for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
      A[i*n+j] = rand()%(10-1)*1; 
}

void print_matrix(double *A, int n)
{
  for(int i = 0; i < n; i++){
    for(int j = 0; j < n; j++)
      printf("%1.2f\t", A[i*n+j]);
    printf("\n");
  }
  
  printf("\n");
}

int main(int argc, char **argv)
{
  int n = atoi(argv[1]);  
  int i, j, k;
  struct timeval begin, end;

  double  *A = (double *) malloc (sizeof(double) * n * n);
  double  *B = (double *) malloc (sizeof(double) * n * n);
  double  *C = (double *) malloc (sizeof(double) * n * n);

  fill_matrix(A,n);
  fill_matrix(B,n);

  gettimeofday(&begin, 0);
    
  #pragma omp target data map(to:A[:n*n], B[:n*n], n) map(from:C[:n*n])
  {
   #pragma omp target teams distribute parallel for private(i,j,k)
   for(i = 0; i < n; i++) 
     for(j = 0; j < n; j++)
       for(k = 0; k < n; k++) 
         C[i*n+j] += A[i*n+k] * B[k*n+j];
  }

   gettimeofday(&end, 0); 
    
   long seconds = end.tv_sec - begin.tv_sec;
   long microseconds = end.tv_usec - begin.tv_usec;
   double elapsed = seconds + microseconds*1e-6;
    
    printf("%d x %d  %.2f seconds\n", n, n, elapsed);   
  
  //print_matrix(A,n);
  //print_matrix(B,n);
  //print_matrix(C,n);

  return 0;
}

Writing mm-omp5.c


In [38]:
!clang mm-omp5.c -o mm-omp5 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

In [39]:
!./mm-omp5 1000

1000 x 1000  2.11 seconds


### Comparison Performance Analysis

| Program Version      | Execution Time (sec.)  | Speedup      |
| :---                 |    :----:              |        ---:  |
| Serial               | 4.55                   | 1X           |
| OpenMP T=36          | 0.05                   | 91X          |
| OpenACC              | 0.64                   | 7X           | 
| CUDA                 | 0.19                   | 24X          | 
| OpenMP5              | 2.11                   | 2X           | 

## Next

Please continue to the next notebook: Please continue to the next notebook: [Accelerate a Thermal Conductivity Application](2-heat.ipynb).