<div align="center"><h1> Accelerate a Thermal Conductivity Application </h1></div>

Thermal conductivity can be defined as the rate at which heat is transferred by conduction through a unit cross-section area of a material, when a temperature gradient exits perpendicular to the area. We have considered a 2D heat conduction application that is a stencil-based codeas the application    to    run    on    multicore and GPUs environments.

![Thermal Conductivity](./images/heat-conduction-2.png)

In the following applied problem, you will be accelerating an application that simulates the thermal conduction of silver in 2 dimensional space.

Convert the `step_kernel` function inside [`heat-sequential.cpp`] to execute on the multicore and GPU, and modify the `main` function to properly allocate data for use on CPU and GPU. Because this code involves floating point calculations, different processors, or even simply reordering operations on the same processor, can result in slightly different results. For this reason the error checking code uses an error threshold, instead of looking for an exact match. 

## ⊗ Sequential

In [None]:
%%writefile heat-sequential.cpp
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

void step_kernel(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;
  int i, j;

  // loop over all points in domain (except boundary)
  for ( j=1; j < nj-1; j++ ) {
    for ( i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

int main(int argc, char **argv)
{
  int istep;
  int nstep = 20000; // number of time steps

  // Specify our 2D dimensions
  const int ni = 1024;
  const int nj = 1024;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp_tmp, *temp1, *temp2;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  temp1     = (float*)malloc(size);
  temp2     = (float*)malloc(size);

  // Start measuring time
  struct timeval begin, end;
  gettimeofday(&begin, 0);

  // Initialize with random data
  for(int i = 0; i < ni*nj; ++i)
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);

  // Execute the kernel version
  for(istep=0; istep < nstep; istep++) {
    step_kernel(ni, nj, tfac, temp1_ref, temp2_ref);

    // swap the temperature pointers
    temp_tmp = temp1_ref;
    temp1_ref = temp2_ref;
    temp2_ref= temp_tmp;
  }

  gettimeofday(&end, 0);
  long seconds = end.tv_sec - begin.tv_sec;
  long microseconds = end.tv_usec - begin.tv_usec;
  double elapsed = seconds + microseconds*1e-6;
    
  printf("%d x %d (%d) %.3f seconds\n", ni, nj, nstep, elapsed);

  free( temp1_ref );
  free( temp2_ref );
  free( temp1 );
  free( temp2 );

  return 0;
}

In [None]:
!g++ heat-sequential.cpp -o heat-sequential -O3

In [None]:
!./heat-sequential

## ⊗ OpenMP

In  the  first  step we  try  to  optimize  this  code for  a  multicore  system.  Here  we  insert OpenMP parallel for directive before the nested loop in step_kernel. This directive tells the compiler to distribute the work among all threads set in `OMP_NUM_THREADS` environment  variable.  We  use  the  commonly  used  host  compiler GCC to  compile  this program. First we simply insert the following one line directive:

~~~c++
#pragma omp parallel for private(i, j, i00, im10, ip10, i0m1, i0p1, d2tdx2, d2tdy2) shared(fact,temp_in,temp_out)
~~~

The directive tells the compiler to convert the following loop nest into a kernel that will be executed on the accelerator (in this case CPU cores is the accelerator). In the following the code:

In [None]:
%%writefile heat-omp.cpp
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <omp.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

void step_kernel_OpenMP(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;
  int i, j;

 #pragma omp parallel for private(i, j, i00, im10, ip10, i0m1, i0p1, d2tdx2, d2tdy2) shared(fact,temp_in,temp_out)
  // loop over all points in domain (except boundary)
  for ( j=1; j < nj-1; j++ ) {
    for ( i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

int main(int argc, char **argv)
{
  int istep;
  int nstep = 20000; // number of time steps

  // Specify our 2D dimensions
  const int ni = 1024;
  const int nj = 1024;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp_tmp, *temp1, *temp2;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  temp1     = (float*)malloc(size);
  temp2     = (float*)malloc(size);

  // Start measuring time
    struct timeval begin, end;
    gettimeofday(&begin, 0);

  // Initialize with random data
  for (int i = 0; i < ni*nj; ++i) {
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);
  }

  // Execute the kernel version
  for (istep=0; istep < nstep; istep++) {
    step_kernel_OpenMP(ni, nj, tfac, temp1_ref, temp2_ref);

    // swap the temperature pointers
    temp_tmp = temp1_ref;
    temp1_ref = temp2_ref;
    temp2_ref= temp_tmp;
  }

  gettimeofday(&end, 0);
  long seconds = end.tv_sec - begin.tv_sec;
  long microseconds = end.tv_usec - begin.tv_usec;
  double elapsed = seconds + microseconds*1e-6;
    
  printf("%d x %d (%d) %.3f seconds\n", ni, nj, nstep, elapsed);

  free( temp1_ref );
  free( temp2_ref );
  free( temp1);
  free( temp2);

  return 0;
}

In [None]:
!g++ heat-omp.cpp -o heat-omp -fopenmp -O3

In [None]:
!OMP_NUM_THREADS=4 ./heat-omp

The following table the performance and speedup of OpenMP code compared to the serial code. It does not show significant speedup because the kernel where we have inserted the OpenMP parallel for directive is inside the main loop, therefore the overhead incurred is quite high; creating and destroying threads when enteringand exitingthe parallel region respectively.

| Program Version      | Execution Time (sec.)  | Speedup     |
| :---                 |    :----:              |        ---: |
| Serial               | 9.76                   | 1X           |
| OpenMP T=2           | 4.66                   | 2.09X        |
| OpenMP T=4           | 2.35                   | 4.15X        |
| OpenMP T=6           | 1.64                   | 5.95X        |
| OpenMP T=8           | 1.29                   | 7.56X        |   
| OpenMP T=16          | 0.75                   | 13.01X       |   

## ⊗ OpenACC

In this step we will see if we can improve the performance by using OpenACC directives. Here we will show how to parallelize the code with PGI’s compiler. First we simply insert the following one line directive:

~~~c++
#pragma acc kernels copyin(temp_in[0:ni*nj]) copy(temp_out[0:ni*nj])
~~~

The directive tells the compiler to convert the following loop nest into a kernel that will be executed on the accelerator (in this case GPU is the accelerator). We also specifythat data temp_in should  becopied to the device before the kernel  execution but  not  copied back to the host after the kernel execution. The data temp_out needs to be both copied to the device before the kernel  execution and  copied  back  to  the  host  after the kernel execution.The  reason  that temp_out also  needs  to  be  copied  to the device  is  that  the kernel only updates the inner points value, while temp_out also includes boundary points values. If we just use copy out(temp_out), then the boundary points values that transferred to  the  host  would  be  garbage  values. 

In [None]:
%%writefile heat-acc.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

void step_kernel_OpenACC(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;
  int i, j;

  #pragma acc kernels present(temp_in[0:ni*nj],temp_out[0:ni*nj])
  #pragma acc loop collapse(2) independent
  // loop over all points in domain (except boundary)
  for ( j=1; j < nj-1; j++ ) {
    for ( i=1; i < ni-1; i++ ) {
      // find indices into linear memory
      // for central point and neighbours
      i00 = I2D(ni, i, j);
      im10 = I2D(ni, i-1, j);
      ip10 = I2D(ni, i+1, j);
      i0m1 = I2D(ni, i, j-1);
      i0p1 = I2D(ni, i, j+1);

      // evaluate derivatives
      d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
      d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

      // update temperatures
      temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
    }
  }
}

int main(int argc, char **argv)
{
  int istep;
  int nstep = 20000; // number of time steps

  // Specify our 2D dimensions
  const int ni = 1024;
  const int nj = 1024;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp_tmp, *temp1, *temp2;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  temp1     = (float*)malloc(size);
  temp2     = (float*)malloc(size);

  // Start measuring time
  struct timeval begin, end;
  gettimeofday(&begin, 0);

  // Initialize with random data
  for( int i = 0; i < ni*nj; ++i) {
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);
  }

  // Execute the kernel version
  #pragma acc data copy(temp1_ref[0:ni*nj]) copyin(temp2_ref[0:ni*nj]) deviceptr(temp_tmp)
  for (istep=0; istep < nstep; istep++) {
    step_kernel_OpenACC(ni, nj, tfac, temp1_ref, temp2_ref);

  // swap the temperature pointers
   temp_tmp = temp1_ref;
   temp1_ref = temp2_ref;
   temp2_ref= temp_tmp;
  }

  gettimeofday(&end, 0);
  long seconds = end.tv_sec - begin.tv_sec;
  long microseconds = end.tv_usec - begin.tv_usec;
  double elapsed = seconds + microseconds*1e-6;
    
  printf("%d x %d (%d) %.3f seconds\n", ni, nj, nstep, elapsed);

  free( temp1_ref );
  free( temp2_ref );
  free( temp1);
  free( temp2);

  return 0;
}

In [None]:
!pgcc heat-acc.c -o heat-acc -acc -O3

In [None]:
!./heat-acc

We once again compare the performances obtained by OpenACC against the OpenMP and the serial version. The following table high lights the performance numbers.

| Program Version      | Execution Time (sec.)  | Speedup     |
| :---                 |    :----:              |        ---: |
| Serial               | 9.76                   | 1X           |
| OpenMP T=16          | 0.75                   | 13.01X       |  
| OpenACC              | 0.64                   | 15.25X       | 

## ⊗ CUDA

In [None]:
%%writefile heat-cuda.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <cuda.h>

// Simple define to index into a 1D array from 2D space
#define I2D(num, c, r) ((r)*(num)+(c))

__global__ void step_kernel_mod(int ni, int nj, float fact, float* temp_in, float* temp_out)
{
  int i00, im10, ip10, i0m1, i0p1;
  float d2tdx2, d2tdy2;

  int j = blockIdx.x * blockDim.x + threadIdx.x;
  int i = blockIdx.y * blockDim.y + threadIdx.y;

  // loop over all points in domain (except boundary)
  if (j > 0 && i > 0 && j < nj-1 && i < ni-1) {
    // find indices into linear memory
    // for central point and neighbours
    i00 = I2D(ni, i, j);
    im10 = I2D(ni, i-1, j);
    ip10 = I2D(ni, i+1, j);
    i0m1 = I2D(ni, i, j-1);
    i0p1 = I2D(ni, i, j+1);

    // evaluate derivatives
    d2tdx2 = temp_in[im10]-2*temp_in[i00]+temp_in[ip10];
    d2tdy2 = temp_in[i0m1]-2*temp_in[i00]+temp_in[i0p1];

    // update temperatures
    temp_out[i00] = temp_in[i00]+fact*(d2tdx2 + d2tdy2);
  }
}

int main(int argc, char **argv)
{
  int istep;
  int nstep = 20000; // number of time steps

  // Specify our 2D dimensions
  const int ni = 1024;
  const int nj = 1024;
  float tfac = 8.418e-5; // thermal diffusivity of silver

  float *temp1_ref, *temp2_ref, *temp1, *temp2, *temp_tmp;

  const int size = ni * nj * sizeof(float);

  temp1_ref = (float*)malloc(size);
  temp2_ref = (float*)malloc(size);
  cudaMallocManaged(&temp1, size);
  cudaMallocManaged(&temp2, size);


  // Start measuring time
    struct timeval begin, end;
    gettimeofday(&begin, 0);
 
  // Initialize with random data
  for(int i = 0; i < ni*nj; ++i) {
    temp1_ref[i] = temp2_ref[i] = temp1[i] = temp2[i] = (float)rand()/(float)(RAND_MAX/100.0f);
  }

  dim3 tblocks(32, 16, 1);
  dim3 grid((nj/tblocks.x)+1, (ni/tblocks.y)+1, 1);
  
  // Execute the modified version using same data
  for (istep=0; istep < nstep; istep++) {
    step_kernel_mod<<< grid, tblocks >>>(ni, nj, tfac, temp1, temp2);

    // swap the temperature pointers
    temp_tmp = temp1;
    temp1 = temp2;
    temp2= temp_tmp;
  }

  gettimeofday(&end, 0);
  long seconds = end.tv_sec - begin.tv_sec;
  long microseconds = end.tv_usec - begin.tv_usec;
  double elapsed = seconds + microseconds*1e-6;
    
  printf("%d x %d (%d) %.3f seconds\n", ni, nj, nstep, elapsed);

  free( temp1_ref );
  free( temp2_ref );
  cudaFree( temp1 );
  cudaFree( temp2 );

  return 0;
}

In [None]:
!nvcc heat-cuda.cu -o heat-cuda

In [None]:
!./heat-cuda

Finally we once again compare the performances obtained by OpenACC, OpenMP and CUDA and the serial version. The following table high lights the performance numbers.

| Program Version      | Execution Time (sec.)  | Speedup      |
| :---                 |    :----:              |        ---:  |
| Serial               | 9.76                   | 1X           |
| OpenMP T=16          | 0.75                   | 13X       |  
| OpenACC              | 0.64                   | 15X       | 
| CUDA                 | 1.43                   | 7X        | 


## Credit

> Credit for the original Heat Conduction CPU source code in this task is given to the article [An OpenACC Example Code for a C-based heat conduction code](http://docplayer.net/30411068-An-openacc-example-code-for-a-c-based-heat-conduction-code.html) from the University of Houston.

## Next

Please continue to the next notebook: Please continue to the next notebook: [Seismic Modelling - 1D Wave Equation](3-wave.ipynb).